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Studies in Supervisory Evaluation 


Quentin W. File and H. H. Remmers 
Purdue University 


As management starts examining the general quality of the super- 
vision which was obtained during the war, it becomes increasingly ap- 
parent that improved selection technics would pay their costs many times 
over. Good “bosses” do not just happen. Unfortunately too, poor ones 
cannot be completely rehabilitated by intensive company training pro- 
grams. Interviews, though an absolute necessity, cannot, when used 
alone, provide complete objective evaluations of potential supervisors’ 
qualifications. 

The purpose of this article is twofold: First, to compare the findings of 
Sartain’s study,' as reported in the August issue of this Journal, with the 
findings reported in the study by File? which resulted in the development 
of the test, How supervise?, and second, to report more recent evidence of 
the validity of this test. 

Briefly summarizing some of the points reported by Sartain, it would 
seem that in his study: 


1. Rather reliable ratings of the abilities of the supervisors studied 
were obtained. 

2. All the standardized tests considered measured factors other than 
those included in the ratings obtained on these supervisors. 

3. Mental ability as measured by Tiffin and Lawshe’s Adaptability 
test is in that particular plant negatively related to the attitudes and 
understandings set forth by the test, How supervise?, as being necessary 
for supervisory success. 


Before raising questions as to the general applicability of these find- 
ings, it should be pointed out that Dr. Sartain was very careful to em- 


1Sartain, A.Q. Relation between scores on certain standard tests and supervisory 
success in an aircraft factory. J. appl. Psychol., 1946, 30, No. 4 (1946). 
* File, Quentin W. The measurement of supervisory quality in industry. J. appl. 
Psychol., 1945, 29, 323-337. 
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phasize that his data were obtained in a single war plant. He likewise 
mentions the possibility that his criterion may possess inherent weak- 
nesses. The following observations, therefore, should be considered, not 
as a criticism of his study, but as an attempt to evaluate the possibilities 
for generalization which obtain. 


1. In the last stages of World War II, the aviation industry was prob- 
ably much less stable than the average established manufacturing organi- 
zation. Its payrolls contained a sizable proportion of workers drawn 
from other more permanent industries. Many of these workers by this 
time realized that loss of their jobs in the near future was a foregone con- 
clusion. Its top management, too, may well have reflected the effecis of 
rapid expansion by failing to formulate adequate standards for super- 
visory performance. 

2. Management’s ratings may be reliable without being valid. Man- 
agement above the operating level must rely primarily on line organiza- 
tion channels for its information. Ratings by higher management may 
well be a reflection of the immediate supervisor’s opinions, thus producing 
spuriously high correlations. 

3. The reported correlations between management’s ratings and the 
ratings of job success were higher than the reliabilities of the ratings 
themselves (as estimated by the Spearman-Brown Prophecy Formula). 
Since this would indicate some systematic bias in the scores, it seems 
possible that all the criterion scores may have a common origin, namely 
the judgment of the individual giving the grades for job success. 


Sartain and File agree that multiple ratings by line management do not 
correlate to any significant extent with scores on the experimental edition 
of How supervise? File in his study of some 577 supervisors in ten in- 
dustries also failed to find relationships above .15 between Management’s 
ratings and: (1), work experience; (2), education or supervisory training; 
(3), supervisory experience, and (4), stability of employment. 

One highly significant difference between Sartain’s study and previ- 
ously reported evidence is the relationship between general mental 
ability and scores on the experimental edition of How supervise? Sartain 
reports of correlation of —.44 (N = 40) where general mental ability was 
measured by the Adaptability test. File, using a slightly different ap- 
proach, obtained a correlation of .35 (N = 577) between highest educa- 
tional level attained and scores on the supervisory ability test. Since 
there is a known positive relationship between general mental ability and 
scholastic achievement, the findings appear contradictory. The follow- 
ing observations are submitted in support of File’s findings: (1), Tests 
requiring reading ability normally correlate .30 or more with “‘intelli- 
gence” test scores; and (2), File’s study was based on over five hundred 
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supervisors in ten different industrial concerns while Sartain’s study 
included only forty supervisors in an expanded war industry. This 
difference between the findings of the two studies may constitute reason 
for questioning whether the supervisors in the sample studies by Sartain 
are sufficiently typical to be used as cases for drawing general conclusions 
concerning the usefulness of tests in selecting supervisory personnel. 


Recent Evidence of the Validity of How supervise? 


Considerable evidence has now been obtained that the revised edition 
of How supervise? * does possess validity for the selection of industrial 
supervisors and a more comprehensive validation program is now in 
progress. Results reported by three industries may be summarized as 
follows: 


I. One form of the test was given to 46 successful supervisors and 14 
non-supervisors (by-passed because of judged lack of ability) in an office 
machine manufacturing company. Results of the investigation are as 
follows: 











Successful Non- 
Supervisors Supervisors 
Per cent Above 50th Percentile.....................-005- 80 15 
Per cent Below 50th Percentile....................-.005- 20 85 
BP I wo oo wa a oPceenccnccccecvccsvedes 75 23 


Critical Ratio of Difference Between Proportions Above and 
Below S0th Percentile Point. .. . . 20... cece cece ccc eceee 5.8 





II. Excerpts from report by the Supervisor of Testing in a company 
which manufactures surgical supplies. 


“1. [have computed the reliability of the test on 50 cases and find that 
its reliability is .85. A reliability of .85 means roughly that in 60 cases 
out of 100 a person will achieve approximately the same score when tested 
a second time. 

“The conventional odds-evens method of determining the reliability 
was used. 

“2. I have done some work on validity using an expedient method 
which is not fully acceptable, but it shows a positive trend which en- 
courages me to continue this work when more valuable criteria are avail- 
able. Asa result of my evaluations, I selected either present supervisors 
or those whom I had recommended as potential supervisors, and con- 
sidered these men successful supervisors. Since my evaluations had been 


* Published by The Psychological Corporation, 522 Fifth Avenue, New York City. 
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made independent of the How supervise test score, I felt that using this as a 
standard I could select another group whom I had not recommended for 
comparison. Using the same method, I selected 20 people who, in my 
opinion, substantiated by test results, could not become supervisors at 
. . . (Name of Company). I contrasted the two groups. The following 
table will give you my findings: 











N = 20 N = 20 
Superior Group Inferior Group 
av tis SERN PSS cbc o's Sie Sh RUARE Cs chee 54 38 
ee I 5s 9 «bth os cenit das wives 9. 11. 
Standard Error of Mean....................... 2.01 2.45 
S's 6 u uuibkwée és se npeksh soeeahe>sabavecinane 4.4 





“The above brief table shows you that the average for the ‘super- 
visors’ is higher than the ‘non-supervisors.’ The standard deviation 
shows that the ‘supervisor’ group is more consistent that the ‘non-super- 
visor’ group. The Standard Error of the mean shows that in 68 cases out 
of 100, the average of the superior group will range between 56.01 and 
51.99 and that in 68 cases out of 100, the average for the inferior group 
will range between 40.55 and 35.45. The Standard Error of the difference 
between the two means is 4.4 which is sometimes called the critical ratio. 
If the critical ratio is 3, it is considered significant and means that in 100 
chances out of 100 the test is differentiating. 4.4 is additional assurance 
that it is really differentiating.” 


III. Report submitted by the General Manager of a relatively large 
laundry. Sixteen supervisors were given How supervise. These in- 
dividuals were divided into the following groups: 


Group I. The “Company has complete confidence in Group I to 
handle all types of supervisory problems. Each individ- 
ual is rated by us as superior in this respect. 

Group II. “This group is doing an excellent job, but occasionally 
needs follow-up on practices. Delicate situations oc- 
casionally require assistance. 

Group III. “These individuals are new, and have only recently been 
assigned supervisory responsibilities. Preliminary evi- 
dence would indicate success. ° 

Group IV. “These usually give substandard supervisory performance. 

Group V. “Experience has proved that these people can be given 
only the most limited supervisory responsibilities. How- 
ever, both have other qualities so valuable to us that 
their services will be retained.” 
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Number Average Percentile 

in Each Raw Value of 

Group Score Average Score 
Group I 6 54.7 96 
Group II 3 49.7 91 
Group III 3 44.7 79 
Group IV 2 40 68 
Group V 2 32 33 





Obviously these studies do not constitute conclusive evidence of the 
universal validity of How supervise? as a supervisory selection device. 
The numbers of cases in the studies were small. Selection of the criterion 
groups was made either by a personnel department man or by some unex- 
plained method. These criterion groups by definition constitute the 
extremes of the distribution and could be expected to show more startling 
differences than if all the supervisory personnel in those concerns were 
included. 

On the other hand, evidences of satisfactory discrimination were found 
in three widely different supervisory groups. Though the numbers of 
individuals considered in each sample were small, statistically significant 
differences were obtained. Considering all three studies as a unit, 116 
supervisors or potential supervisors were measured and, if statistically- 
combined, the computed discrimination ratio would be higher than those 
reported for the individual studies. 

Summarizing briefly, the independent indications of the validity of 
How supervise? which have been obtained to date are: 


1. Significant increases in supervisory understanding have been 
measured by administering the test before and after supervisory training. 

2. As reported above, significant differences have been found between 
“successful” supervisors and individuals by-passed because of lack of 
supervisory ability. 

3. No company has reported evidence, either subjective of statistical, 
that differences in supervisory ability are not measured by the revised 
edition of the test. A considerable number of concerns have accepted 
it validity on the basis of subjective evidence and report receiving 
valuable results. 


With this background of strong preliminary evidence of validity, a 
more comprehensive investigation is now being undertaken. Provision 
is also being made for the construction of a new business and management 
form of the test. This edition is intended for use among management 
above the foreman level and among supervisors in business organizations 
as well as those in industrial concerns. 


Received June 24, 1946. 





Studies in Job Evaluation. 5. An Analysis of the Factor 
Comparison System as it Functions in a 
Paper Mill * 


C. H. Lawshe, Jr., and R. F. Wilson 
Division of Applied Psychology, Purdue University 


With the increasing acceptance of job evaluation as a wage negotiation 
and stabilization technique come several fundamental problems. What 
basic judgment factors determine the differential wages to be paid workers 
on various jobs? Can these factors be discovered and systematized into 
a list which will cover all or nearly all jobs, so that differential wage levels 
can be determined fairly and objectively? The purpose of this series of 
papers is to work toward an answer to these questions by analyzing some 
of the currently accepted job evaluation systems and their results as they 
function in various industrial situations. 

Factor analysis techniques in previous studies have yielded two factors 
which define themselves with striking similarity and consistency from 
plant to plant and sometimes a third which is related to the uniqueness 
of the type of plant or class of jobs being rated. The first factor correlates 
highly with mental and skill requirements and other job elements which 
connote these qualities and has been designated as “Skill Demands.” 
The second factor correlates highly with such elements as working con- 
ditions, and hazards and has been designated as “Job Characteristics.”’ 

It has been found in the former studies that an abbreviated scale 
composed of selected items from the original scale will yield results cor- 
relating highly with the original scale ratings. The resultant displace- 
ment of jobs in terms of cents per hour, which would occur as a result of 
using the abbreviated scale, would be so small as to suggest strongly the 
advisability of simplifying job evaluation systems to attain administra- 
tive efficiency and lower cost in the operation of the job evaluation pro- 
gram. 


* This article is a “prior publication,” the author paying complete costs. The 
scheduled 80 pages per issue is thereby increased by the corresponding amount, thus the 
“early publication” of this article is a direct contribution to the subscribers of the Journal 
of Applied Psychology without handicap to those authors whose articles are accepted 
and printed in their regular turn. 
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Purpose of This Study 


The former studies have all dealt with point rating systems. It was 
felt that a study of the Factor Comparison System, as presented by 
Benge, Burk, and Hay, (1), might yield particularly significant results 
for two reasons. 

First, the Factor Comparison System, through its method of job to job 
comparison, should tend to minimize the “halo” effect. Under the point 
rating systems, the analyst, in considering a job description, then rating 
it on a series of scales, might tend to rate that job at the same level on 
several scales. For instance, the rater, having decided that a high degree 
of mental ability was required for a job, might tend to let that decision 
influence his rating on several subsequent scales, such as Responsibility, 
Versatility, Experience, etc. This would increase the “halo” effect, and 
would tend to give spuriously high correlations of the job elements with 
each other, which in turn would result in a smaller number of factors and 
distorted weighting of the job elements on those factors. 

Second, the Factor Comparison System might avoid to some degree 
the tendency of a rater to check many jobs at or near the same level on a 
given scale. If a rater tends to rate all jobs at or near the middle of an 
item scale, say Working Conditions, the effect is to eliminate any dis- 
crimination on the basis of that job element, producing the same effect 
as if that element had been left out of the scale altogether. 

On the basis of these considerations, this particular study promised to 
yield further insight into the nature of the basic job evaluation factors as 
identified in the judgment process. 


The Factor Comparison System 


Selection of Key Jobs. The Factor Comparison System of job evalu- 
ation involves the comparison of jobs being rated with a scale of ‘‘key”’ 
jobs rather than the evaluation of jobs against an a priori point scale. 
Fifteen to twenty five jobs, ranging from the highest paid jobs to the 
lowest paid jobs in the plant, are selected by a job evaluation committee. 
These must be jobs which have clearly defined job descriptions and the 
rates of which are generally judged to be fair. 

Ranking of Key Jobs. Members of the committee, individually and 
collectively, rank these key jobs in order of difficulty on Mental Require- 
ments. Then they rank the key jobs again on Physical Requirements, 
and so on until the key jobs have been ranked on each of the following 
five job elements: Mental Rrequirements, Physical Requirements, Skill 
Requirements, Working Conditions, and Responsibility. 

The Salary or Wage Breakdown. The wage or salary for each of the 
key jobs is then analyzed by estimating the amount of it that is being 
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paid for each of the job elements. The following example will serve to 
illustrate. Shop Clerk,—Job Description No. 124: Mental Require- 
ments, $23; Physical Requirements, $9; Skill Requirements, $20; Work- 
ing Conditions, $16; add Responsibility, $32; making a total salary of 
$100. 

Matching the Breakdown With the Rank. The amounts estimated by 
the committee as being paid for Mental Requirements on each of the key 
jobs are entered on a sheet opposite the final ranking, as illustrated by the 
following hypothetical example: 








Estimated Amount 
Paid for Mental 





Rank Job Title Requirements 
1 Dept. Supervisor $70 
-2 Asst. Supervisor 47 
3 Machine Bookkeeper 31 
4 Card Punch Operator 21 
5 Mail Clerk 25 
6 File Clerk 18 
7 Messenger Boy 11 





When the matching is done, some of the amounts (as No. 5 above) may 
be out of line with the ranking. If the amount is only slightly out of line, 
the committee may decide to reprice the elements on that job to adjust 
this difficulty; otherwise that job is eliminated from the list of key jobs. 
This matching of amounts and rankings is then completed for each of the 
five scales in this manner, and only those jobs which fall in line are re- 
tained as key jobs. 

Rating the Bulk of the Jobs. At this point the job analyst has five 
“measuring sticks’ with which to rate other jobs. Each level on the 
“measuring sticks” is described by a representative job, and each level 
also has its “price tag.” In evaluating a different job, the analyst first 
compares that job with the key jobs on the Mental Requirements scale 
to determine which of the key jobs demands a similar degree of mental 
ability. Having selected a point on the scale at or near the most com- 
parable key job, the analyst assigns the indicated amount to be paid for 
Mental Requirements on the job being rated. By repeating this process 
on each of the five scales, the analyst has five amounts which added to- 
gether equal the indicated salary or wage for the job being rated. 


Procedure 


The cooperating industry, a paper mill, furnished Factor Comparison 
job evaluation data on one hunared and seventy-six job classifications. 
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The cents-per-hour values obtained for the five job elements on each job 
classification were punched on IBM (machine sort) cards. The constants 
determining the intercorrelations were obtained from machine tabula- 
tions, the correlations computed, and a correlation matrix prepared 
(Table 1). 

Table 1 


Intercorrelation Coefficients Between Ratings on Five Factors and Total Points 
for 176 Jobs in a Paper Mill 











2. 3. 4. 5. 6. 
Mental Physical Skill 
Require- Require- Require- n- 
ments ments ments sibility Total 
1. Working Conditions — .09 76 — .03 —.10 23 
2. Mental Requirements —.16 92 91 .92 
3. Physical Requirements — .06 —.16 16 
4. Skill Requirements 89 94 
5. Responsibility 90 





Factor analysis of these intercorrelations by Thurstone’s controid 
method yielded two factors with job element loadings as shown in Table 
2. The extraction process was discontinued when Thurstone’s phi test 
was satisfied. These two factors account for virtually all of the vari- 
ability in total points since the communality (h?) equals unity. Rotation 
was accomplished by the graphical method described by Guilford (2). 
The Wherry-Doolittle shrinkage selection method was applied to the 
values in the correlation matrix, and three items were selected for an 
abbreviated scale. 


Identification of Factors 


Factor I. Inspection of Table 2 demonstrates a convincing affirma- 
tion of the two basic factors found in the former studies. Factor I 

















Table 2 
Factor Loadings for Five Scale Items and Total Points 
Before Rotation After Rotation 

Scale Items ky ke h? ky ke h? 
Working Conditions 37 —.77 .73 .06 +.85 .73 
Mental Requirements 82 54 .96 .96 —.20 .96 
Physical Requirements 31 —.78 .70 .00 +.84 .70 
Skill Requirements 87 46 97 .98 —.11 97 
Responsibility 81 .53 94 95 —.20 94 
Total Points 98 .22 1.00 .99 +.16 1.00 
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clearly has to do with Mental Requirements, Skill Requirements, and 
Responsibility, as demonstrated by the heavy loadings .96, .93 and .95. 
Working Conditions and Physical Requirements have practically no 
weighting on Factor I. The correlation matrix (Table 1) shows the high 
correlation of these three job elements with each other. Evidently the 
raters are judging only slightly different aspects of the same thing. 
These three job elements correspond convincingly with the definition of 
the “Skill Demands” factor developed in the previous studies. 

Factor II. Factor II quite clearly defines itself in terms of the other 
two job elements, Working Conditions and Physical Requirements, and 


2. JOB CHARACTERISTICS 
2% 3 














|. SKILL DEMANOS 
98% 





Fig. 1. Graph showing proportion of the total variability attributable 
to each of the two factors. 


this corresponds with the “Job Characteristics” factor obtained in the 
previous studies. The small negative loadings which the other three 
elements have with Factor II are reasonable to expect. The correlation 
matrix (Table 1) further shows that Working Conditions and Physical 
Requirements correlate much higher with each other than with any of 
the other job elements. 

Proportion of Variability Accounted for by Each Factor. As in former 
studies, Factor I accounts for an extremely high proportion of the final 
variation of job rates. Since the partial correlation (k,) of Total Points 
with Factor I is .99, the coefficient of determination would be k,? which 
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equals .98, and Factor I would account for 98% of the variability in 
Total Points as shown in Figure 1. In like manner it is seen that Factor 
II accounts for 2% of the variation in Total Points. 


Adequacy of an Abbreviated Scale 


Items Selected. Application of the Wherry-Doolittle shrnkage selec- 
tion process to the values in Table 1 using total points as the criterion 
selected three of the job elements for an abbreviated scale. Since Skill 
Requirements alone correlates .94 with Total Points, the procedure was 
followed to identify those items which would increase the correlation 
most. The process was discontinued after selection of the third element 
because the multiple correlation of Skill Requirements, Working Condi- 
tions, and Mental Requirements with the results of the original scale had 
already become slightly higher than .99. Addition of a fourth element 
would have increased the multiple correlation by less than .008. The 
multiple correlation results are shown in Table 3. 











Table 3 
Multiple Correlation Coefficient Between Groups of Items and Total Points 
Items R 
Skill Requirement Alone 94 
Skill plus Working Cond. 97 
Skill plus Working Cond. .99 
plus Mental Req. 





Accuracy of the Abbreviated Scale. The wage administrator, however, 
is less interested in abstract correlation figures, and more interested in the 
number of cents per hour by which the various jobs would be displaced as 
a result of using the abbreviated scale. Figure 2 shows the “‘scatter’’ 
obtained by plotting the wages, as calculated from the abbreviated scale, 
against the job wages obtained originally with the complete scale. Still 
further analysis of the data, as shown in Table 4, reveals that 166 out of 
the 176 jobs would be displaced by four cents or less, and that no job 
would be displaced by more than 8 cents. The average difference in 
wage rate is 1.7 cents. 


Discussion 


In regard to the advisability of using an abbreviated number of 
elements in the job evaluation system as opposed to use of the system in 
its original and complete form, there is another important consideration 
beyond merely demonstrating that an abbreviated scale will yield results 
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HOURLY WAGES ~ THREE ELEMENTS 





40 45 50 55 60 65 70 75 80 8 90 95 WO 05 WO WS 120 125 
HOURLY WAGES — FIVE ELEMENTS 


Fig. 2. Graph showing wages computed by the abbreviated three element scale plotted 
against wages computed by using the original five-element scale. 


2 Table 4 


Differentials in Cents per Hour Between Rates Computed with Five Factors and 
Rates Compiled from Three Factors 














Difference Frequency Cumulative Percentile 
in Cents Frequency Percentage Frequency Values 
0 32 18 32 18 
1 66 38 98 56 
2 41 23 139 80 
3 16 9 155 88 
4 11 6 166 94 
5 4 2 170 96 
6 2 1 172 98 
7 3 2 175 99 
8 1 1 176 100 
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closely comparable to the original system. No job evaluation system 
yet devised seems to have attained perfection, and thus these abbreviated 
systems are being compared with a fallible criterion. Bearing in mind 
that the criterion is fallible, and that the simplified or abbreviated sys- 
tems do yield closely comparable results, the question arises: are the 
ratings obtained through the use of the abbreviated system perhaps as 
good as, or better than, the ratings obtained from the use of the original 
system? For instance, if all of these job evaluation files were to be 
withheld from the members of this job evaluation committee, and they 
were to go through the complete job rating procedure again, would it not 
be possible that rates on the various jobs would vary from the original 
values more than the abbreviated system results vary from the original? 
Chances are great that they would vary more, for the results of the ab- 
breviated scale correlated .99 with the rates obtained from the original 
scale. Probability is that the reliability of the original system is not that 
high. The pertinent question then becomes: Are not the abbreviated 
scale ratings as good as or better than the original scale ratings? 

Since there is no theoretically perfect criterion of job evaluation, it 
seems that the question, ““Which ratings are the best?’’, would have to 
remain unanswered. However, it is well to bear in mind that job evalu- 
ation as an industrial-wage administration technique does not eliminate 
the chance and error inherent in human judgment, but merely attempts 
to set a framework in which these human judgments may work more 
systematically and reliably. Reliability, then, may be an extremely 
important aspect of job evaluation and should be subjected to systematic 
investigation. 


Summary and Conclusions 


Factor Comparison System job evaluation data from a paper mill 
were subjected to the Thurstone Factor Analysis technique following the 
intercorrelation of points awarded on each of the job elements and the 
total. Rotation was accomplished by the graphical method. The 
Wherry-Doolittle shrinkage selection method was used to select three of 
the job elements for an abbreviated scale. The following findings are 
supported: 


1. The Factor Comparison Job evaluation system, which through its 
mechanics should tend to force the rater to make five separate and distinct 
ratings of the jobs, actually effected judgments on two principal axes in 
this industrial situation. 

2. The first principal axis, or factor, has a heavy loading in mental 
and skill requirements (and other job elements, such as responsibility, 
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which connote these qualities). This factor called ‘Skill Demands’’ is 
responsible for 98% of the final variation in job rates. 

3. The second principal axis, or factor, has heavy loadings in Physical 
Requirements and Working Conditions. This factor called ‘Job Char- 
acteristics” accounts for only 2% of the final variation in job rates. 

4. Application of the Wherry-Doolittle shrinkage selection method 
selected three of the five job elements for an abbreviated scale. These 
three elements, Skill Requirements, Working Conditions and Mental 
Requirements, when combined correlate .99 with the original scale. 

5. While the reliability is not known, it is probable that the correla- 
tion between the original and the abbreviated scale is as high as could 
be obtained with the existing, reliability and that the abbreviated scale 
can be considered as valid and as usable as the original scale. 


Received June 10, 1946. 
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The Learning Curve for Flying an Airplane * 
W. N. Kellogg 


Indiana University 


The object of the present investigation was to examine the process of 
learning to fly. It was directed specifically towards the plotting of learn- 
ing curves and the study of the manner in which the student-pilot elim- 
inates his incorrect or erroneous responses as he masters the flying tech- 
nique. 


Procedure 


Apparatus. In order to obtain objective records, a special mechanism, 
known as the pilot-response recorder, was developed and installed in a 
Piper Cub Trainer. This device is illustrated in Figure 1. The pilot- 
response recorder weighs about 10 pounds and makes automatic graphic 
tracings of the extent and duration of the rudder, aileron, and elevator 
movements while the plane is in flight. By means of a system of cams or 
wedges (W, Fig. 1) the absolute extent of the airplane control movements 
is transmitted to the clockwork polygraph of the pilot-response recorder 
in direct linear proportion. The writing pointers are mounted on sleeves 
and move ina straight line across the paper. Errors which are common in 
similar devices, such as the distortion introduced by the ares of writing 
levers which are pivoted at a fulcrum, errors of changing air pressure 
within pneumatic systems, or the variation in the elasticity of tambours 
at different tensions, were eliminated by this method. The entire ap- 
paratus was mounted in a concealed position behind the cockpit. It was 
therefore possible to keep the student-pilot from knowing that records of 
his flying were being made at all. 

Sample records made by this device are reproduced in Figure 2. The 
lines show movements of the rudder, elevator, and ailerons which were 
traced during the process of making landings. The first ground contact 
in each instance is indicated by the vertical broken line, so that, except 
for subsequent bumps, the portion of each tracing to the right of the 


* The investigation was financed by the Civil Aeronautics Authority through the 
Committee on Selection and Training of Civilian Pilots of the National Research Coun- 
cil. The data for this study were obtained in 1939 and 1940, but publication was neces- 
sarily withheld until after the termination of the war. 

1 The pilot-response recorder has been patented by Indiana University under the 
name of the airplane multiple control recorder. 
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broken line represents taxiing. Time intervals shown on the bottom 
horizontal line are 10 seconds in length. 

Types of Records Made. A standard course, which required about 10 
minutes to fly, was laid out with fixed pilons on the ground. The course 
included four left turns and three right turns. Pilot-response records for 
flying the course with records of the corresponding landings and take-offs 
were made by both student and instructor after approximately every 30 














Fic. 1. The pilot-response recorder is a light-weight polygraph by means of which 
the movements of the airplane controls can be graphically traced. A patented system 
of cams or wedges (W) transmits the control movements to the paper in linear proportion 
to their absolute extent. Errors which might be introduced by the ares of writing levers 
which are pivoted at a fulcrum, by pneumatic systems, or by the variable tensions of 
tambour diaphragms are eliminated by this construction. 


minutes of flight instruction. Periodic records were also taken of steep 
and shallow figure-eights and of 360 degree glides-to-a-landing. 

The Weather-control Technique. The object of having the instructor 
make flying records along with the student was to obtain some kind of a 
base or standard with which to compare the student’s performance. This 
base could not be a fixed one, but would be constantly changed or modi- 
fied by variable weather conditions. To cancel out this possible source 
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of error, the instructor made the same maneuvers as the student, either 
immediately before or immediately after the student had made them. 
Since the student’s and the instructor’s records were obtained but a few 
moments apart, over the same terrain, the difference between them could 
be regarded as a difference between the skill of the expert or finished pilot 
and the performance of the beginner. 

Every student record, therefore, had paired with it the corresponding 
record made by the instructor under the same flying conditions. To find 
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Fic. 2. The irregular lines show movements of the rudder, elevator, and ailerons 
which were recorded during the process of making landings. First ground contact in 
each instance is indicated by the vertical broken line. The tracings to the right of the 
broken lines, therefore, represent taxiing. Time intervals on the horizontal line at the 
bottom are 10 sec. in length. 


what a student’s errors were one compared the objective record of his 
flight with the appropriate control record made by the instructor. This 
method has been called the weather-control technique. 

Quantifying the Data. The graphic records made by the pilot-response 
recorder were measured by means of a special device known as a graphom- 
eter, which automatically totals the vertical deflections or oscillations 
from the horizontal of any irregular or wavy line. Readings from the 


*W. N. Kellogg. A device for measuring kymographic records. J. erp. Psychol., 
1936, 19, 383-385. 
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graphometer converted to numerical form the total amount of movement 
of each of the airplane controls within any given time period. By com- 
paring the graphometer readings of the student and the instructor it was 
possible to tell at once which person moved any given control more or 
less than the other person, and exactly how much more or less he moved it. 


Results 


The results in this report cover the training of two student-pilots. 
Presented below are a few selected items which seem to offer the most 
promise for the analysis of the learning process. 

Course Records. In Figure 3 is shown the learning curve plotted from 
graphometer readings of the elevator movements of student-pilot C, 











TEST FLIGHTS 


Fic, 3. Learning curve plotted from elevator movements of subject C, showing the 
gradual elimination of overcontrolling with practice in fiying over a standard course. 
Correct manipulation of the controls is represented by the horizontal line. 


during his flights over the standard ground course. The points plotted 
are all ratios of the amount of elevator movement made by the student 
(S) divided by the amount of elevator movement made by the instructor 
(I). The curve includes 17 test flights or, roughly, 12 hours of instruction 
(17 half-hour periods plus 17 ten-minute periods of course flying). 
Student C made his first solo flight between test flight numbers 12 and 13. 

Since the points on the graph are all ratios, one can tell at once that 
student-pilot C began his flying by moving the elevator about five times 
as much as the instructor moved it. He was therefore overcontrolling 
very badly. A ratio of 1.0 (indicated by the broken horizontal line) would 
mean that the student moved the elevator the same amount that the 
instructor did within the same time period. It will be seen from Figure 
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3 that student C gradually eliminated his elevator overcorrections so that, 
after the eighth test flight, he was not far from the instructor’s perform- 
ance. 

In Figure 4 is shown a similar curve for student-pilot C, but one which 
is a composite or combination of the movements of all three of the airplane 
controls. It appears from this learning curve that the student-pilot on 
the whole moved the controls less than the instructor moved them. This 
is indicated by the fact that the level of the curve is most of the time below 
the ratio of 1.0. Comparing the first part of the learning curve in Figure 
4 with the first part of the curve in Figure 3 one may infer that since sub- 
ject C overcontrolled so much with the elevator he must have undercor- 
rected with the other controls. As a matter of fact, this individual was 
much too limited in his rudder movements, as the graph of the course 
records for the rudder (not presented here) demonstrated. 


SOLO 


RATIO Si 
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Fie. 4. Learning curve showing reduction in the manipulation of all three con- 
trols as compared to correct or ideal use of controls which is indicated by horizontal 
line. 

Records of Landings. One of the most difficult maneuvers which the 
new pilot has to perfect is the maneuver of landing. It is, moreover, a 
maneuver from which many records can be easily obtained and one which 
must remain highly practiced with the pilot as long as he flies an airplane. 
It should be clear also that in the maneuver of landing the elevator plays 
by far the most important part. A good landing is actually made only 
with the elevator and throttle (unless flaps are used). The rudder and 
ailerons should not be employed except in the approach to the field and 
in correcting for bumps in rough air. 

The perfect landing is one in which the stick is gradually drawn back- 
wards (the tail lowered) as the plane loses speed in its landing glide. In 
the case of a three-point landing the stick should be all the way back at 
the moment the tail and landing wheels make contact with the ground 
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(see Fig. 2). Poor landings are those in which there is too much forward 
movement of the stick. The student “pumps” the stick back and forth 
as he tries to “find the ground.”” Improvement in landings should there- 
fore be shown by the reduction in forward stick-movements with practice. 

In order to get at this problem, pilot-response elevator records were 
measured for a period of 15 seconds as the plane came intoalanding. ‘A 
landing” was arbitrarily defined by this means as the 15 seconds of flying 
time which ended with ground contact. The learning curve plotted from 
such measurements, combined from the graphometer readings of the 
elevator movements of two subjects (C and P), is shown in Figure 5. 
Each point on the graph is a ratio of forward movements (F) divided by 
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Fic. 5. Showing improvement in the use of the elevator during landings only. 
Composite learning curve for two subjects. 


backward movements (B) of the stick—combined for two student pilots. 
When the ratio is high (1.0 or 1.5) it means that the subject is pushing 
forward too much on the stick during his landings. When the ratio is 
low it means that he is making few forward movements and that the 
landings are therefore “good.” 

From an examination of Figure 5 it appears that there is a rapid 
improvement in landing skill for the first few hours of instruction, and 
that thereafter the progress is slow—as in the mastery of any difficult 
skill. 


Conclusions 


The following propositions seem to be justified by the limited data of 
this study. 
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1. The objective analysis of airplane control movements can be used 
to show progress in the development of flying skill. 

2. By means of the pilot-response recorder and the graphometer, the 
psychologist can tell which controls the pilot is manipulating incorrectly, 
in which direction his errors occur, and how great they are. 

3. The weather-control technique seems to be adequate to cancel out 
variations in flying conditions. 

4. Learning curves for various maneuvers in flying are essentially the 
same as those obtained in the development of other skills. 

5. There is no evidence of plateaus in the curves of learning to fly, as 
plotted from the present data. 


Received October 18, 1945. 








The Purdue Mechanical Adaptability Test * 


C. H. Lawshe, Jr., Irene A. Semanek, and Joseph Tiffin 
Division of Applied Psychology, Purdue University 


Personnel administrators are realizing more and more that modern 
personnel programs are strengthened and improved through the use of 
personnel tests. Although many tests are available for use in industry, 
there is still a definite need for tests designed specifically for that purpose. 
The Purdue Mechanical Adaptability Test ' is a test consisting of 60 ques- 
tions about practical mechanical facts which are answered “‘yes,”’ “no,”’ 
or “don’t know.” It was designed to measure “knack” in mechanical, 
electrical, and related activities by means of an evaluation of experience 
in these areas. 

In the process of standardizing and validating the test on industrial 
populations the personnel departments of eleven manufacturing concerns 
cooperated by administering this test to applicants for jobs of a mechani- 
cal nature and to employees already working on such jobs. In the case 
of employees presently on the job, success ratings were obtained from 
supervisors. 

The intent of the authors was to select items of high internal consist- 
ency and at the same time to select items relatively unrelated to mental 
ability. Both of these objectives are important since highly consistent 
items may also have a high relationship to intelligence. 


Construction and Standardization 


Original Construction. The initial step in the development of The 
Purdue Mechanical Adaptability Test consisted of the construction of 400 
items of practical information in seven work areas. A deliberate effort 
was made to have these items meet the following criteria: 


1. The item must deal with practical information obtainable from 
first hand contact and not with theoretical principles or concepts. 


*This article is a “prior publication,” the author paying complete costs. The 
scheduled 80 pages per issue is thereby increased by the corresponding amount, thus the 
“early publication” of this article is a direct contribution to the subscribers ‘of the 
Journal of Applied Psychology without handicap to those authors whose articles are 
accepted and printed in their regular turn. 

1 The Purdue Mechanical Adaptability Test by C. H. Lawshe, Jr., and Joseph Tiffin 
is copyrighted by the Purdue Research Foundation and is distributed by the Division 
of Applied Psychology, Purdue University, Lafayette, Indiana. 
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2. The item must be as short as possible. 
3. All words (except technical vocabulary) should fall within the 
ability of a normal eighth grade student. 


These 400 items were arranged into four test forms, ST, UV, WX, and 
YZ, which were administered in mimeographed form to 138, 110, 109 and 
122 male students respectively in grades 10, 11, and 12 in two comprehen- 
sive high schools. The Adaptability Test,’ a test of general mental ability, 
was administered to these populations at the same time and the 30% 
scoring highest and the 30% scoring the lowest were segregated. The 
‘Kelley technique” of item validation described by Lawshe* was employed 
and D-value (discrimination values) on The Mechanical Adaptability Test 
items were computed using these mental ability groups as criterion groups. 
Since it was desirable to discard items that were known to be highly re- 
lated to intelligence, arbitrary D-value limits of +.3 and —.3 were es- 
tablished and all items having D-value outside of that range were dis- 
carded. This left 207 items which were distributed among the four 
preliminary forms as follows: ST, 53; UV, 51; WX, 45; and YZ, 58. 

Item Selection. Only these selected items were scored, and the high 
30% and low 30% of the population on each form were identified to be 
used as criterion groups for computing D-values against these total scores. 
By this process, only those items having a D-value of .8 or better were 
retained. One hundred items were then selected from these four pre- 
liminary forms, each item of which had a D-value of .8 against total 
scores on the selected items and none of which had a D-value computed 
against mental ability that deviated from zero by more than plus or 
minus .3. These 100 items constituted Form R and were arranged in 
approximate order of difficulty. This trial form, also mimeographed, 
was administered to 250 boys in the 10th, 11th, and 12th grades in a 
trade school and to 189 male college students, a substantial number of 
whom were engineering students. Members of this latter group were 
asked to make written criticisms of any of the items that in their judg- 
ment were in any way ambiguous. On the basis of these comments, 
minor adjustments in items were made and the revised test was printed 
as Form 8S. Further analysis involved items that were common to Form 
R and to Form § of the test. 

Revision. Form Rand Form § of the Purdue Mechanical Adaptability 
Test were then administered to men in industrial situations. Of the 462 


? Tiffin, Joseph, and Lawshe, C. H., Jr. The Adaptability Test: a fifteen minute 
mental alertness test for use in personnel allocation. J. appl. Psychol., 1943, 27, 152- 
163. 

* Lawshe, C. H., Jr. A nomograph for estimating the validity of test items. J. appl. 
Psychol., 1942, 26, 848-849. 
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cases used in the final selection of test items, 364 were industrial applicants 
in a steel mill, and 98 were already on the job in the following types of 
plants: foundry, screw manufacturing, electrical supply manufacturing, 
and farm machinery manufacturing. The Adaptability Test designed to 
measure mental alertness was also administered to all but twenty-four of 
the cases. 

The “Kelley technique” of item validation already mentioned was 
again used for estimating the validity of individual test items. Since it 
was desirable to reduce the number of items by selecting those that 
tended to measure the same thing and hence were highly consistent and 
yet at the same time not highly related to mental ability, internal and 
external criteria for item selection were again used. 

The internal criterion was the total score on the 100 items of the test 
prior to the item analysis; the external criterion was the score obtained on 
The Adaptability Test. In computing internal consistency D-values, 
criterion groups consisted of the 30% making the highest scores on all 
hundred items and the 30% making the lowest scores on the same items. 
After D-values were computed, papers for those cases for which The 
Adaptability Test scores were available were isolated and criterion groups 
selected to include the 30% scoring highest and the 30% scoring lowest. 

A scattergram of internal consistency D-values vs. mental ability D- 
values was then plotted. All items which did not yield an internal con- 
sistency D-value of .5 or better were discarded. Since it was desirable to 
minimize the intelligence factor in the test, all items yielding a D-value 
of .7 or greater against The Adaptability Test were discarded, thus elimin- 
ating those items which tended to be associated most closely with mental 
ability. 

This process yielded 60 items which were incorporated into Form A 
(Men) with the items arranged in order of increasing difficulty. Table 1 
shows the content areas of the questions incorporated in Form A and the 
number of items in each area. 

The method of scoring the earlier forms as well as Form A is as fol- 
lows: (1) the number of correct responses is counted; (2) that total is 
doubled; (3) to that is added the number of responses marked “Don’t 
Know.” This yields the raw score‘ which may be converted to a per- 


* This raw score is based upon a modification of the standard correction for guessing 
(R—W) where two choices exist. The derivation is as follows: 
Score = R—-W+60 
W = 60—R-DK 
Score = R—(60—R-—DK)+60 or 2R+DK 
The modification has two advantages: (1) since 60 is added to each score, all nega- 
tive scores are eliminated, and (2) in combination with a scoring stencil, scoring is 
simpler since no marks need be made on the test papers and wrongs need not be counted. 
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Table 1 


Content Areas Incorporated in the Purdue Mechanical Adaptability 
Test Form A (Men) 














No. of 

Area Items 
Woodwork and Finishing 10 
Automobile 17 
Electricity and Radio 18 
Machine Shop 4 
Plumbing 4 
Sheet Metal 2 
Miscellaneous 5 
Total 60 





Table 2 


The Relationship of Test Scores on Form A of the Purdue Mechanical Adaptability 
Test and Other Measures 








Test or Measure 





N r or 
California Capacity: Non-language 25 Al 13 
California Capacity: Language 25 12 .16 
Bennett Test of Mechanical Comprehension 33 71 09 
Minnesota Paper Formboard 39 18 .16 
Age 40 32 14 

















Table 3 
Differences between Means for College Sub-groups and their Significances 

Group N Mean C.R. 
Mechanical and Aeronautical Engineers 71 103.1 + 1.4 6.3 
Science, Pharmacy, and Physical Educ. 103 91.7 + 1.2 ; 
Mechanical and Aeronautical Engineers 71 103.1 + 1.4 38 
Civil, Metallurgical, and Electrical Engineers 54 95.6 + 1.5 , 
Civil, Metallurgical, and Electrical Engineers 54 95.6 + 1.5 21 
Science, Pharmacy, and Physical Educ. 103 91.7+1.2 - 





All Purdue Students 274 99.7 + 0.7 
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Table 4 
Percentile Norms for the Mechanical Adaptability Test, Form A 
College Men* 
Industrial “Non- “Non- ““Mechan- 

; Men om Mechanical”’ ical’”’ All 
Percentile (N =1015) (N = 103) (N =54) (N =71) (N =274) 
100 116 114 114 118 118 
95 108 109 lll 115 114 
90 105 107 111 112 lll 
80 99 102 108 110 108 
70 95 97 105 108 106 
60 90 93 101 106 103 
50 86 91 97 104 99 
40 81 89 OF 102 96 
30 76 86 91 99 93 
20 72 82 87 94 89 
10 65 76 80 88 82 

5 61 72 76 83 77 
1 57 69 72 79 74 





* College groups were constituted as follows: “Non-engineering”’ included Science, 
Pharmacy, and Physical Education students; “non-mechanical” included Civil Engi- 
neers, Electrical Engineers and Metallurgical Engineers; ““Mechanical” included Me- 
chanical Engineers and Aeronautical Engineers; the “‘all’’ classification was a random 
sample of Purdue University students. 


centile equivalent by means of Table 4. The highest possible score on 
Form A is 120, which is obtained when all 60 items are correct. 


Validity 

The validity of the Purdue Mechanical Adaptability Test for industrial 
use in employee and trainee allocation is shown by the relationship be- 
tween scores on the test and employee success on the job as measured by 
supervisory ratings. The studies which follow involved employees who 
took either Form R or Form § of the test. However, in the results pre- 
sented here, only those items which now appear in Form A were scored. 
The first three of the validity studies below involved cases not included 
in the primary group used in the item selection procedure. 

Ice Company Mechanics. The correlation’ between scores on The 
Purdue Mechanical Adaptability Test and success on the job as measured 
by supervisor’s rankings of fourteen experienced mechanics in an ice 
company was .86 + .07. The effectiveness of this test in identifying the 
highest rated mechanics can be seen by examining the scattergram in 


5 All plus and minus error designations in this paper pertain to standard errors. 
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Figure 1. The best mechanic was the highest scorer on The Mechanical 
Adaptability Test, and the poorest one scored lowest. The other me- 
chanics tend to fall in a linear pattern. 

Further inspection of Figure 1 shows that if a score of 90 were the 
minimum acceptable score for this job, 80% of the less desirable me- 
chanics (rated 1 and 2) would be rejected and only 20% would be ac- 
cepted. Of the more desirable mechanics (rated 3 or higher), 89% would 
be accepted and only 11% would be eliminated. If the minimum ac- 
ceptable score were set at 95, all of the least desirable mechanics would 
fall below this score, as would 60% of the average (rated 3); but at the 
same time, all of the more desirable mechanics (rated 4 or 5) would fall 
at or above this critical score. This indicates that a mechanic scoring 
higher on The Mechanical Adaptability Test, other things being equal, 


5. 


4 


RATING 








70 75 80 85 90 95 100 105 110 115 120 
MECHANICAL ADAPTABILITY TEST SCORE 


Fic. 1. A scattergram of scores on Form A and job success ratings of fourteen 
mechanics in an ice company. 


would on the average be more successful on this job than one scoring 
lower on the test. 

Time Study Men. Six time study men from a company manufactor- 
ing musical instruments were ranked by the supervisor and The Purdue 
Mechanical Adaptability Test was administered to them. The rank order 
correlation was found to be .75 + .18. A scattergram of these cases is 
shown in Figure 2. It will be noted that there is one inversion in the case 
of the time study man rated 4 and that otherwise there is a perfect cor- 
relation. Although there were only six cases in the study, it is interesting 
to note that the three highest scorers were also the three highest ranking 
employees in the group. 

Steel Mill Apprentices. Twelve apprentices in a steel mill took the 
test at the time of hiring. Figure 3 shows a plot of test scores and rank 
order ratings made by the supervisor after they had been on the job. 
This group included the following apprentices; four machine shop, two 
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Fic. 2. A scattergram of test scores and supervisory rankings of six time study men 
in a musical instrument plant. 


masonry, and one each of electrical, maintenance, carpenter shop, black- 
smith, pipe shop helper, and electrical construction apprentices. The 
rank order correlation was found to be .39 + .24. When the apprentices 
rated one to six are, arbitrarily, put into the “high” group and the others 
in “low,” 50% are in the “high” group to begin with. With a minimum 
acceptable score of 70, the percentage of people rated high is increased 
to 55%; a minimum score of 89 increases the percentage of ‘‘high”’ scores 
to 60%. While this relationship between scores is not statistically 
significant, in combination with the other studies reported here, it yields 
some indication of what can be expected. 

Foundry Workers. Data on another group of twelve men who were 
employed in a foundry were studied. Ratings were prepared by the 
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Fie. 3. Test scores and rankings of twelve steel mill apprentices. 
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foreman after the employees had been working in the plant thirty days. 
Three men were rated very good, eight as fair and good, and one as poor. 
An inspection of the data showed that if a critical score of 70 were used, 
the employee rated poor and two of the fair to good class would not be 
acceptable, yet all those rated very good scored above this critical point. 

Screw Manufacturing Plant. The test was administered to 46 ex- 
perienced machine operators in a screw manufacturing plant. The 
operators were rated 5.0 (lowest rating), 5.5, 6.0, and up to 19.5 (highest). 
Ratings and scores were plotted on a scattergram. Figure 4 shows that 
if mechanics rated 13.5 or better are considered “high,” the percentage of 
employees rated “high”’ increases as the minimum acceptable test score is 
increased. The figure shows that with a minimum acceptable score of 
75, about 44% of those passing were “high” rated. With a minimum 
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Fic. 4. Graph showing percentage of screw machine operators who are rated “high” 
when successively higher critical test scores are employed. 


acceptable score of 85, 48% of those passing are “high,’’ and with a 
critical score of 95, 64% are in the “high” rated group. Since only 37% 
of the whole group were rated “high” to begin with, the number of “high”’ 
rated people increases as the minimum acceptable score is raised. 
Electrical Company Apprentices. A group of 40 trade apprentices 
from an electrical equipment manufacturing plant consisted of fourteen 
machinists, eleven toolmakers, seven diemakers, one foundryman, and 
seven miscellaneous electrical workers. The apprentices were rated C+, 
B—, B, B+, A-—, and A (highest). Since there was a tendency for ap- 
prentices in some of the fields to be consistently rated higher than ap- 
prentices in other fields, the apprentices in each group were divided into 
approximately the highest rated half and the lowest rated half. Those 
rated high in each of the trade areas were pooled in the computations. 
Figure 5 shows that 47% of the apprentices in all trade areas were rated 
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“high.” However, with a critical score of 80, 50% are rated “high’’; and 
when higher critical scores are considered, the percentage of “‘high”’ 
rated employees tends to increase. 

Validity for Guidance Use. The adequacy of differential group norms 
as an estimate of test validity has frequently been questioned. However, 
so marked were some of the student group differences obtained that the 
facts are recorded here for such value as they may have. Men students, 
274 in number, in the Elementary Psychology course at Purdue University 
were given Form A. These men, mostly underclassmen, were drawn 
from all curricula in the University. By combining students in certain 
related programs or curricula, three sub-groups each exceeding 50 in 
number were obtained. For example, those students enrolled in Science, 
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Fic. 5. Graph showing percentage of trade apprentices in an electrical company who 
are rated “high” when successively higher critical scores are employed. 


Pharmacy, and Physical Education were combined. Those students in 
specific engineering curricula were divided into the ‘‘mechanical’’ group 
(Mechanical Engineering, Aeronautical Engineering, and Air Transport- 
ation) and the “non-mechanical” group (Civil Engineering, Metallurgical 
Engineering and Electrical Engineering). Table 3 presents the mean 
scores for each group as well as the critical ratios based upon the differ- 
ences between them. All obtained differences are in the expected direc- 
tion, the “mechanical” group being at the top and the non-engineering 
group at the bottom. Note that the critical ratios range from 2.1 to 6.3. 
These facts coupled with the industrial validity data just presented might 
be useful in evaluating the test for guidance purposes. 
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Test Characteristics 


Reliability. Reliability of Form A was obtained by the split-half 
method with two population samples, an industrial group and a college 
group. The first sample (not the primary group used for item analysis) 
consisted of 487 men, all applicants for industrial jobs with an optical 
manufacturing plant. Scores for the odd numbered items yielded a 
coefficient of .84 + .01 when correlated with scores on the even numbered 
items (stepped up by Spearman-Brown formula). The same procedure 
was repeated with a new group of 201 college men and a corresponding 
coefficient of .80 + .03 was obtained. 

Relationship to Mental Ability. Since The Purdue Mechanical Adapt- 
ability Test was designed to predict success on mechanical jobs, one of the 
aims was to eliminate items closely identified with intelligence. One 
study of the intelligence factor was made with the 485 cases on which the 
split-half reliability was computed. Total scores on Form A were cor- 
related with scores on The Adaptability Test. A coefficient of correlation 
of .32 + .04 was obtained. It should be kept in mind that this was not 
the primary group and that the range of mental ability was quite great. 
Using a group of 173 college men, the correlation between The Adapt- 
ability Test and The Purdue Mechanical Adaptability Test, a coefficient of 
.17 + .07 was found. 

The raw scores on the Otis Self-Administering Higher Examination of 
twenty-five mechanics employed by a farm machinery manufacturing 
firm correlated .08 + .20 with scores on Form A. 

Language and non-language raw scores on The California Capacity 
Test were available on forty apprentices varying in training experience 
from six months to three and one-half years. The group was in training 
with an electrical manufacturing plant, and included machinists, tool- 
makers, die makers, and electrical testers. Table 2 shows the relation- 
ship of the language and non-language factors to The Purdue Mechanical 
Adaptability Test. An examination of the standard errors of r for the two 
parts of the test as shown in Table 2 indicates a considerably greater 
probability that there is a true correlation in the case of the non-verbal 
test. This is evidence that a certain communality exists between the 
Purdue Mechanical Adaptability Test and the non-verbal test of mental 
ability that does not exist in the case of the verbal mental ability test. 

Relationship with Other Measures. Other data available on the 
machinist apprentices mentioned above included raw scores on the 
Bennett Test of Mechanical Comprehension, The Minnesota Paper Form- 
board, and the ages of the apprentices. Correlations with scores on Form 
A of The Mechanical Adaptability Test are indicated in Table 2. Of 
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special interest is the correlation of .71 between Form A of The Purdue 
Mechanical Adaptability Test and the Bennett Test of Mechanical Com- 
prehension, since it purports to measure a similar trait or ability. 


Summary 


Forms R and § of The Purdue Mechanical Adaptability Test were ad- 
ministered to 1,015 industrial applicants or employees already on the job 
in eleven manufacturing plants. The “Kelley technique’ as described 
by Lawshe was used for estimating the validity of test items. A revision 
based on internal and external criteria resulted in Form A of the test for 
men or boys. The split-half method of determining the reliability was 
utilized and stepped up by the Spearman-Brown formula. Validity 
studies were made by using as criteria the job success of employees as 
estimated by supervisor’s ratings. The following results were obtained: 


1. Of the 100 items included in Forms R and §, the 60 items yielding 
an internal consistency D-value of .5 or better, and an external consist- 
ency D-value of .6 or less against The Adaptability Test were selected for 
Form A. Thus, items which tended to be associated with mental ability 
were eliminated; those that were highly consistent were retained. 

2. Validity studies of test scores and job success as estimated by 
supervisor’s ratings were made on employees in six different manufactur- 
ing plants. A study of the data showed that in general when the mini- 
mum acceptable score on The Mechanical Adaptability Test was increased, 
an increase in the per cent of people rated “high” could be expected, thus 
increasing the proportion of desirable employees on the job. 

3. An examination of mean scores of special college curricular groups 
revealed significant differences from group to group in the expected 
direction. 

4. The reliability of the revised test as obtained on a secondary group 
of 487 cases by the split-half method and stepped up by the Spearman- 
Brown formula was .84 + .01. The corresponding value for a group 
of college men was found to be .80 + .03. 

5. The item analysis procedures succeeded to a large degree in ac- 
complishing the objective of formulating a test that is relatively unre- 
lated to intelligence as measured by various standard intelligence tests. 
This is indicated by the following coefficients of correlation with various 
intelligence tests: The Adaptability Test, .17 + .07 and .32 + .04; Otis 
Self-Administering Higher Examination, .08 + .20; The California 
Capacity Test: Language .12 + .16; The California Capacity Test: Non- 
language, .41 + .13. 
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6. In one sample, correlations of .71 + .09 and .18 + .16 with the 
Bennett Test of Mechanical Comprehension and the Minnesota Paper Form- 
board were obtained. 

7. The Purdue Mechanical Adaptability Test is useful in identifying 
men or boys who are mechanically inclined and are likely to succeed on 
jobs of a mechanical nature and it can be used in personnel situations as a 
supplement to regular employment procedures. 


Received June 19, 1946. 














The Relative Readability of Newsprint and Book Print * 


Donald G. Paterson and Miles A. Tinker 
University of Minnesota 


In earlier studies'** the authors have investigated the readability of 
newsprint and of book print but no direct comparison has been made be- 
tween the two kinds of printing. There are, however, various hints that 
newsprint may be read at a slower rate than book print. Paterson and 
Tinker*® found a consistent tendency for 6 and 8 point book type to be 
read slower than larger sizes of type. The most frequently used type size 
in newspaper printing is 7 and 8.2. In another kind of study, Tinker‘ 
discovered that in reading 7 point newsprint a greater intensity of light 
was needed for adequate perception than was necessary with 10 point 
book type.’ Nevertheless, since newsprint and book print represent 
somewhat different typographical situations, there is not enough evidence 
for an adequate statement of their relative readability. 

A direct comparison of the two kinds of printing is made in this study. 
Specifically, the purpose of the investigation is to compare the speed of 
reading commonly used newsprint and book print. 

In our survey of newspaper printing,* the following was the most 
common practice for body types: Ionic type face was most frequently 
used, with Opticon the most popular of the newer type faces; 12 pica line 
width; 7 and 8 point type; and one point leading. In the same study we 
noted that one point leading improves readability of newsprint but that 
two point gives no added advantage. In view of these results and 
practices we chose the following newspaper typography for use in this 
study: Arrangement number one was 7 point Ionic No. 5 in a 12 pica line 


* Grateful acknowledgment is given to the Graduate School, University of Minne- 
sota, for research grant to finance this study. 

1 Tinker, M. A., and Paterson, D. G. Differences among newspaper body types in 
readability. Jour. Quart., 1943, 20, 152-155. 

* Tinker, M. A., and Paterson, D. G. War time changes in newspaper printing 
practice. Jour. Quart., 1944, 21, 7-11. 

* Paterson, D. G., and Tinker, M. A. How to make type readable. New York: 
Harper and Brothers, 1940, pp. 209. (Obtainable from the authors.) 

‘Tinker, M. A. Illumination intensities for reading newspaper type. J. educ. 
Psychol., 1943, 34, 247-250. 

5’ Tinker, M. A. The effect of illumination intensities upon speed of perception and 
upon fatigue in reading. J. educ. Psychol., 1939, 30, 561-571. 

* Tinker, M. A., and Paterson, D. G., op. cit., 1943. 
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width with one point leading. Arrangement number two consisted of 
8 point Opticon in a 12 pica line width with one point leading. Both 
were printed on newsprint paper stock. Incidentally, Opticon was the 
most readable type face of nine investigated in another study.’ For the 


Cheltenham Book Type: 
10 point with two point leading 


26. James’ fountain pen went dry when he was 
doing his homework for school. He was very cross 
because until he got some more glue he could not 
continue his work. 27. The boys saw coming towards 
them an old woman, bent with sorrow, dressed in 
deepest black. They thought, turning from their play 
to watch her pass, how happy she looked. 28. On 


Opticon Newsprint: 
8 point with one point leading 


26. James’ fountain pen went 
dry when he was doing his home- 
work for school. He was very 
cross because until he got some 
more glue he could not continue 
his work. 27. The boys saw com- 
ing towards them an old woman, 
bent with sorrow, dressed in deep- 
est black. They thought, turning 
from their play to watch her pass, 
how happy she looked. 28. On 


Ionic No. 5 Newsprint: 
7 point with one point leading 


26. James’ fountain pen went dry 
when he was doing his homework for 
school. He was very cross because un- 
til he got some more glue he could not 
continue his work. 27. The boys saw 
coming towards them an old woman, 
bent with sorrow, dressed in deepest 
black. They thought, turning from 
their play to watch her pass, how 
happy she looked. 28. On Sunday Mr. 


Fic. 1. Samples of book type and newsprint type used in study of relative readability. 


book print we chose an optimum typographical arrangement. (See 
Paterson and Tinker, op. cit., 1943.) This consisted of Cheltenham type 
face, 10 point with two point leading in a 20 pica line width on eggshell 
paper stock. Samples of the printing used are shown in Figure 1. 


7 Tinker, M. A., and Paterson, D. G., op. cit., 1943. 
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The reading material consisted of Forms A and B of the Chapman- 
Cook Speed of Reading Test. Although performance on Form B is 
equivalent to that on Form A on the average, this is not always true for 
small samples. A control group was introduced, therefore, to check on 
this equivalence. There were 30 paragraphs of 30 words each in each 
testform. The reading time allowed for each form was 134 minutes. 

Three groups of 90 college students each served as subjects. In 
Group I (control) the subjects read book print in Form A and Form B. 
In Group II, Form A was book print and Form B was the 8 point Opticon 
newsprint. And in Group III, Form A was book print, and Form B was 
the 7 point Ionic No. 5 newsprint. In addition to the above comparisons, 
an additional 117 college students ranked samples of the print according 
to apparent legibility and according to pleasingness. In this part of the 
experiment, samples of 150 words (five paragraphs of 30 words each) 
were mounted on cardboard and presented to the readers in a controlled 
manner. 


Results and Discussion 


Data for the speed of reading comparisons are given in Table 1. Re- 
sults for the control group (Group I) show that a “‘correction’’ must be 
made by adding 1.59 paragraphs to the mean for Form B. Examination 
of the results for Groups II and III reveals that the 8 point Opticon news- 
print was read 0.92 of a paragraph more slowly than the book print, and 
that the 7 point Ionic newsprint was read more slowly than the book 
print by 1.01 paragraphs. These amount to a retardation in reading rate 
of 4.3 and 4.8 per cent respectively. The critical ratios in Column 10 of 
the table show that these differences are statistically significant. 

These results demonstrate that commonly used newsprint even when 
printed in an optimum arrangement is read much more slowly than book 
print set in an optimum typographical arrangement. 

The following factors probably operate to reduce the rate at which 
the newsprint was read: 1. The small size of newsprint type in comparison 
with the book type makes visual discrimination more difficult; 2. The 
lower brightness contrast between type and paper for the newsprint 
would adversely affect discrimination of the printed characters; and 3. 
Newspaper body types may not be as legible as book type faces. It is 
unlikely however that this third factor is important. 

Results derived from reader opinions of relative legibility are given in 
Table 2. The order of judgments is 10 point book type ranked first, 


* Tinker, M. A., and Paterson, D. G. Studies of typographical factors influencing 
speed of reading. XIII. Methodological considerations. J. appl. Psychol., 1936, 20, 
132-145. 
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Table 2 
Book Type and Newsprint Ranked According to 117 Reader Opinions of 
Relative Legibility 
Kind of Type Average Rank 8.D. Rank Order 
10 Point Book Type 1.65 .79 1 
8 Point Newsprint 1.68 58 2 
7 Point Newsprint 2.68 .60 3 





followed by 8 point newsprint and then 7 point newsprint. Note, how- 
ever, that there is actually very little difference in ranking 8 point news- 
print and 10 point book print. As has been found before,’ judgments 
of legibility do not always agree with actual readability measurements. 
Readers’ opinions of pleasingness are listed in Table 3. The order 








Table 3 
Book Type and Newsprint Ranked According to 117 Reader Opinions of Pleasingness 
Kind of Type Average Rank 8.D. Rank Order 
10 Point Book Type 1.47 .66 1 
8 Point Newsprint 1.70 54 2 
7 Point Newsprint 2.83 46 3 





from most to least pleasing is 10 point book type, 8 point newsprint and 
7 point newsprint. Although there is some separation between mean 
ranks for the book type and the 8 point newsprint, the difference is not 
great. But the 7 point newsprint is considered definitely less pleasing 
than the others. As in the earlier report,’® pleasingness tends to agree 
with judged legibility. 


Summary and Conclusions 


1. The purpose of this investigation is to compare the readability of 
newsprint and book print. 

2. Speed of reading 10 point Cheltenham book type was compared 
with speed of reading 8 point Opticon newsprint and with 7 point Ionic 
No. 5 newsprint. 

3. Both kinds of newsprint were read significantly more slowiy than 
the book print. 


* Tinker, M. A., and Paterson, D.G. Reader preference and typography. J. appl. 
Psychol., 1942, 26, 38-40. 
1¢ Tinker, M. A.,-and Paterson, D. G., op. cit., 1942. 
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4. The slower rate of reading newsprint is apparently due to the 
greater difficulty of discriminating the printed characters in comparison 
with the book type which is larger and which involves greater brightness 
contrast between print and paper. 

5. The 10 point book print and the 8 point newsprint are judged to be 
about equally legible, but the 7 point newsprint is considered to be far 
less legible. 

6. The book print is judged to be most pleasing, the 8 point newsprint 
next most pleasing, and the 7 point newsprint least pleasing. 


Received November 26, 1945. 











Age of Starting to Contribute versus Total Creative Output 


Harvey C. Lehman 
Ohio University 


In 1937, Professor E. T. Bell of the California Institute of Technology 
published this statement in his book, Men of mathematics: 

“In 1902 and 1904 the Swiss mathematical periodical, L’ Enseignement 
Mathématique, undertook an enquiry into the working habits of mathema- 
ticians. Questionnaires were issued to a number of mathematicians, of whom 
over a hundred replied. . . . To the question ‘At what period . . . and under 
what circumstances did mathematics seize you?’ 93 replies to the first part 
were received: 35 said before the age of ten; 43 said eleven to fifteen; 11 said 
sixteen to eighteen; 3 said nineteen to twenty; and the lone laggard said 
twenty-six” (1, p. 547). 

The early interest in mathematics of those destined to become first- 
rank mathematicians has been asserted and commented on by numerous 
writers. This early interest is confirmed by the Swiss questionnaire 
study. 

Some may doubt that the 93 mathematicians who responded to the 
first part of the above question would interpret it in exactly the same way. 
Others may suspect that some of the respondees were unable to recall 
precisely when they first became interested in mathematics. However, 
the problem of early interest can be approached from another angle as is 
done in the present study. 

The present article presents information regarding the youngest 
chronological ages at which certain noted mathematicians made important 
contributions to their field. As here used, the words “important contri- 
butions’ méan simply any contributions at all which are of sufficient 
merit to be cited and discussed by authorities in the history of mathe- 
matics. 

Since the present writer is neither a mathematician nor a journalist he 
is unable to explain the mathematical contributions listed below ~ny more 
clearly than the authors whose descriptions are quoted. Hence, the 
complete reliance in what follows on verbatim quotations. 

If the reader finds this quoted material boresome, he can omit it 
without missing the main point of this discussion. These quotations 
serve a serious purpose, however, and some who may want to skip them 
at first will perhaps be motivated to read them after perusal of the main 
portion of this article. 
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The technical mathematical terms need not be understood in order 
to follow the gist of the quotations. In so far as the present study is 
concerned, the significance of this creative mathematical work lies in 
the fact that it was accomplished by mere boys not one of whom was over 
21 years old at time of the indicated achievement. 


Niels Henrik Abel, (1802-1829). 

“Abel’s first ambitious venture was an attack on the general equation of 
the fifth d (the ‘quintic’). All of his great predecessors in algebra had 
exhausted their efforts to produce a solution, without success. We can easily 
imagine Abel’s exultation when he mistakenly imagined he had succeeded. . . . 
The supposed solution was of course no solution at all. This failure gave him 
a most salutary jolt; it jarred him onto the right track and caused him to 
doubt whether an algebraic solution was possible. He proved the impossi- 
bility. At the time he was about nineteen . . .” (1), p. 309f.). 


Charles Babbage, (1792-1871). 
“Charles Babbage invented a machine,' called a ‘difference-engine,’ about 
1812. [Age 20]. Its construction was begun in 1822 and was continued for 


20 years. The British Government contributed £17,000 and Babbage him- 
self £6,000” (2), p. 485). 


Charles Julien Brianchon, (1785-1864). 

“His paper on curved surfaces of the second degree was published in the 
Journal de l’Ecole Polytechnique, cahier 13, 1806. . . . This paper contains the 
famous theorem, known under the author’s name, which together with Pascal’s 
theorem [set forth at age 16] is at the very foundation of the projective theory 
of conic sections. . . . It is interesting to note that this article, which made 
the author’s name familiar to every student of geometry, was written by him 
at the age of 21, while he was still in school’’ (3, p. 331). 


Augustin-Louis Cauchy, (1789-1857). 
“In February, 1811, Cauchy submitted his first memoir on the theory of 
polyhedra.” [Age 21 years, 6 months] (1, p. 277). 


Arthur Cayley, (1821-1895). 
“His first work, published in 1841 when he was an undergraduate of twenty, 
grew out of his study of Lagrange and Laplace” (1, p. 381). 


Alexis Claude Clairaut, (1713-1765). 

“In 1731 was published his Recherches sur les courbes & double courbure, 
which he had sal for the press when he was sixteen. It was a work of 
remarkable elegance and secured his admission to the Academy of Sciences 
when still under legal age. In 1731 [age 18] he gave a proof of the theorem 
enunciated by I. Newton, that every cubic is a projection of one of five diver- 
gent parabolas’’ (2, p. 244). 


William oe Clifford, (1848-1879). 

William Kingdon Clifford solved a problem in probability in 1866. [Age 18] 
(3, p. 540f.). 
Leonhard Euler, (1707-1783). 

“Euler’s first independent work was done at the age of nineteen” (1, p. 144). 


J a oseph Fourier, (1768-1830). 
“In December, 1789 Fourier (then twenty-one) went to Paris to present 
his researches on the solution of numerical equations before the academy. 


1 Probably an early form of the mechanical calculator. 
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This work advanced beyond Lagrange, and is still of value . . . it may be 
found in elementary texts on the theory of equations . . .” (1, p. 192). 


Evariste Galois, (1811-1832). 

“. . . Galois at the age of sixteen was already well started on his career 
of fundamental discovery . . .” (1, p. 366). 

“Galois at seventeen was making discoveries of epochal significance in the 
theory of equations, discoveries whose consequences are not yet exhausted 
after more than a century” (1, p. 368). 

“In Feb. 1830, at the age of nineteen . . . he composed three papers in 
which he broke new ground. These papers contain some of his t work on 
the theory of algebraic equations. It was far in advance of anything that had 
been done . . .” (1, p. 370). 

During the night preceding the duel over a love affair which ended his life 
at age 20, Galois . . . “spent the fleeting hours feverishly dashing off his 
scientific last will and testament, writing against time to g ean a few of the 
great things in his teeming mind before the death which he foresaw could over- 
take him. . . . What he wrote in those desperate last hours before the dawn 
will keep generations of mathematicians busy for hundreds of years. He had 
found, once and for all, the true solution of a riddle which had tormented 
mathematicians for centuries: under what conditions can an equation be solved? 
But this was only one thing of many” (1, p. 375). 


Karl Friedrich Gauss, (1777-1855). 
os se A mca his researches on pangeometry as early as 1792. [Age 15] 
» p. 304). . 
“He had already invented [at age 18] the method of ‘least squares,’ which 
today is indispensable in etic surveying, in the reduction of observations 
and indeed in all work where the ‘most probable’. value of anything that is 
measured is to be inferred from a _— number of measurements” (1, p. 227). 
“‘When not quite nineteen years old Gauss began jotting down in a rE. 
book very brief Latin memoranda of his mathematical discoveries. . . . Of 
the 146 entries, the first is dated March 30, 1796, [Age 18 years, 11 months] 
and refers to his discovery of a method of inscribing in a circle a regular polygon 


of seventeen sides. . . . He worked quite independently of his teachers, and 
while a student at Géttingen [Age 18 to 21] made several of his greatest dis- 
coveries. . . . The great law of quadratic reciprocity, given in the fourth 


section of Gauss’ work, a law which involves the whole theory of quadratic 
residues, was discovered by him by induction before he was eighteen, and was 
proved by him one year later’ (2, p. 435). 

“. . . the entry for March 19, 1797, shows that Gauss had already dis- 
covered the double periodicity of certain elliptic functions. He was then not 
= twenty.’ Again, a later entry shows that Gauss had recognized the 

ouble periodicity in the general case. This discovery of itself, had he pub- 
lished it, would have made him famous. But he never published it” (1 


p. 229f.). 

“‘At the age of twenty Gauss had overturned old theories and old methods 
in all branches of higher mathematics; but little pains did he take to publish 
his results, and thereby to establish his priority” {2 p. 434). 

“Why did Gauss hold back the great things he discovered? . . . [A] state- 
ment which Gauss once made to a friend explains both his diary aad his 
slowness in publication. He declared that such an overwhelming horde of new 
ideas stormed his mind before he was twenty that he could hardly control them 
and had time to recerd but a small fraction” (1, p. 227f.). 


Wilhelm Jacob Storm van s’Gravesande, (1688-1742). 

‘*His was another case of the ig Seersamge of mathematical ability, his essay 
on perspective having attracted attention when he was only nineteen years 
old” (4, p. 526). 
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Edmund Halley, (1656-1742). 


. . before he was twenty he communicated a paper to the Royal So- 
ciety. So noteworthy had been his progress that in the very month in which 
he reached his twentieth birthday (November, 1676) he set out for St. Helena 
for the purpose of making astronomical observations. On the day before he 
was twenty-one he made the first complete observation of a transit of Mercury. 
So remarkable was his work at St. Helena that . . . the Royal Society elected 
him to a fellowship when he was only twenty-two” (4, p. 405). 


William Rowan Hamilton, (1805-1865). 

“. . . in his seventeenth year Hamilton had already begun his career of 
fundamental discovery. Before this he had brought himself to the attention 
of Dr. Brinkley, Professor of Astronomy at Dublin, by the detection of an error 
in Laplace’s attempted proof of the parallelogram of forces’’ (1, p. 343). 

“At the age of twenty-three he published the completion of the ‘curious 
discoveries’ he had made as a boy of seventeen, Part I of A theory of systems of 
rays, the great classic which does for optics what Lagrange’s Mécanique ana- 
lytique does for mechanics and which, in Hamilton’s own hands, was to be 
extended to dynamics, putting that fundamental science in what is perhaps its 
ultimate, perfect form” (1, p. 346). 


Charles Hermite, (1822-1901). 

“The Nouvelles Annales de Mathématiques, a journal devoted to the interests 
of students in the higher schools, was founded in 1842. The first volume 
contains two yes composed by Hermite while he was still a student at 
Louis-le-Grand. [Age about 20]. The first is a simple exercise in the analytic 

eometry of conic sections and betrays no originality. The second, which 

only six and a half pages in Hermite’s collected works, is a horse of quite a 
different color’ (1, p. 450). 


Ernst Eduard Kummer, (1810-1893). 
“In his third year at the University Kummer solved a prize problem in 


ern and was awarded his Ph.D. degree at the age of twenty-one” 
1, p. 512). 


Joseph Louis Lagrange, (1736-1813). 

“At the early age of nineteen he sent a solution of the isoperimetrical 
problem to Euler, in which he announced the principle of the calculus of 
variations. 


“This memoir inaugurated a new period in the calculus of variations and 
was esteemed very highly by Euler, who observed that the methods of Lagrange 
were more general than his own” (5, p. 240f.). 


Guillaume Francois Antoine de L’ Hospital, (1661-1704). 

“‘When only fifteen he was one | at the Duc de Roanne’s and heard some 
mathematicians oe of a difficult problem of Pascal’s. To their surprise 
he an that he thought he could solve it, and in a few days succeeded” (4, 
p. ‘ 


Colin Maclaurin, (1698-1746). 

At the age of twenty-one he took to his London printer “his Geometria 
Organica, containing a new and remarkable mode of generating conics, known 
by his name . . .” (2, p. 228). 

James Clerk Maxwell, (1831-1879). 
“At the age of fifteen he published a paper on oval curves” (6, p. 251). 


Gaspard Monge, (1746-1818). 
At about age sixteen Gaspard Monge originated descriptive geometry. 
He “‘was at once given a minor teaching position to instruct the future military 
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yn pone in the new method. Monge was sworn not to divulge his method, 
and for fifteen years it was a jealously guarded military secret” (1, p. 185). 


Francois Nicole, (1683-1758). 
“He was a boy of unusual promise, having shown his genius in geometry 
by rectifying the cycloid at the age of nineteen” (4, p. 472). 


Blaise Pascal, (1632-1662). 

“Before the age of sixteen (about 1639)? he had proved one of the most 
beautiful theorems in the whole range of geometry” (1, p. 76). 

“Presently the family received a somewhat formal visit from Descartes. 
He and Pascal talked over many things, including the barometer. There was 
little love lost between the two. For one thing, Descartes had openly refused 
to believe the famous Essai pour les coniques had been written by a boy of 
sixteen’’ (1, p. 80). 

“At the age of nineteen he invented a eqeapeans machine that served as a 
starting point in the development of the mechanical calculation that has be- 
come so important in our time. That he should have been permitted to 
present one of these machines to the king and one to the royal chancellor shows 
the esteem in which he must have been held”’ (4, p. 382). 


Simeon Denis Poisson, (1781-1840). 
“At eighteen he wrote a memoir on finite differences which was printed on 
the recommendation of A. M. Legendre”’ (2, p. 466). 


Georg Friedrich Bernhard Riemann, (1828-1866). 

“According to Dedekind, ‘Riemann recognized in . . . partial differential 
equations the essential definition of an [analytic] function of a complex variable. 
Probably these ideas, of the highest importance for his future career, were 
worked out by him in the fall vacation of 1847 [Riemann was then twenty-one] 
for the first time” (1, p. 489f.). 


James Joseph Sylvester, (1814-1897). 
“About the age of 16 he was awarded a prize of $500.00 for solving a ques- 
bye in arrangements for contractors of lotteries in the United States’’ (2, p. 


William Thomson (Lord Kelvin), (1824-1907). 

Rediscovered independently, in 1845, the principle of inversion called by 
Liouville the transformation by reciprocal radii. [Age 21] (2; p. 292). 

The foregoing quotations permit glimpses of the mathematical 
maturity that has been attained by some individuals before they were 22 
years old. These quoted statements should help readers to take seriously 
what follows. Standing alone, however, they are inadequate for our 
present purpose. As is indicated by its title, this study concerns itself 
with the relationship between age of starting to contribute creative work 
and total creative life output. We want to know, among other things, 
whether these youthful contributors fulfilled the promise of their early 
youth. The complete picture must include both the answer to this query 
and also information regarding persons who have started to make their 
intellectual contributions at each successive older age level. A bird’s- 


* Authorities differ on Pascal’s age when this work was done, the estimates varying 
from fifteen to seventeen. 
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eye view of our major findings with reference thereto can best be de- 
scribed by means of scattergrams constructed in a manner that will now 
be described. 

Figure 1 presents: (1) the chronological age at which each of 306 
deceased chemists made his first chemistry contribution of sufficient 
merit to be included in T. P. Hilditch’s A concise history of chemistry (7), 
versus (2) the total number of contributions by each chemist included 
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Fie. 1. The age at which each of 306 chemists made his first important chemistry 
contribution versus the total number made by each. 
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therein. For example, this figure reveals that Justus von Liebig, who is 
credited with 31 chemistry contributions in Hilditch’s history, made the 
first of those 31 contributions at age 16. Figure 1 shows also that the one 
person who made the largest number of contributions cited by Hilditch, 
Emil Fischer with 35 contributions, did his first important research in 
chemistry at age 23. 

Figure 1 reveals in general that the larger the total number of notable 
chemistry contributions made by a given individual, the younger the 
chronological age at which that individual started his research career. 
Thus, of the 10 most prolific contributors cited by Hilditch (see upper 
left-half of Fig. 1), only one made his initial contribution as late as age 27. 
Of the 29 largest contributors pictured in Figure 1, only 2 made their first 
important contribution as late as age 20. And of the 45 heaviest con- 
tributors, one only made his first contribution later than age 31. On the 
whole, it seems apparent that the later one starts to achieve in chemistry, 
beyond the early twenties, the smaller the probable total of one’s out- 
standing research contributions. 

Near the base-line of Figure 1, it may be seen that of 140 chemists 
each of whom made one contribution only that is described by Hilditch, 
one individual made his first and only notable chemistry contribution as 
late as age 75. It seems logical to infer that, although there is no dead- 
line beyond which it is impossible to make one’s initial contribution, age 
75 is too old to start contributing if one hopes to make more than one 
important contribution. 

Of 45 persons each of whom made two significant contributions to 
chemistry, only 1 of them did his first research as late as age 65, and 43 of 
the 45 did their first notable research prior to age 45. Similarly, of 31 
persons each of whom made 3 significant contributions, only 2 of them did 
their initial outstanding work as late as age 40. Finally, of 16 persons, 
each of whom made but 4 contributions, only 1 of them did his first 
important research work beyond age 35. 

Oliver Wendell Holmes, Jr., is said by several of his biographers to 
have held the belief or superstition that if genius is to be exhibited at all 
it must be displayed prior to age 40. It is said that Holmes once re- 
marked: “If you haven’t cut your name on the door of fame by the time 
you’ve reached 40, you might just as well put up your jackknife.”’ 

Figure 1 suggests that, in so far as creative chemistry is concerned, 
Holmes’s foregoing figure of speech should perhaps be revised to read as 
follows. If an individual is to make a chemistry contribution which 
ranks in importance with those cited in Hilditch’s history, the chances 
are 6 in 7 that the individual will have completed his first important re- 
research before he has passed age 40. This modification of Holmes’s 
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statement seems justified in view of the finding that of the 306 chemists 
for whom age data are presented in Figure 1, not less than 83% did their 
first important chemistry research prior to age 40. 

The words “not less than 83%” were employed in the above because 
the time lag between date of accomplishment and date of announcement 
thereof is not always known. It seems likely that, if the full story were 
known, some of the other 17% may also have done their initial research 
at younger ages than the available record reveals. 

The data that are set forth graphically in Figure 2 were obtained from 
Cajori’s A history of mathematics (2). This figure presents: (1) the age at 
which each of 444 deceased mathematicians made his first contribution 
of sufficient merit to be mentioned and discussed in Cajori’s history, 
versus (2) the total number of contributions by each mathematician dis- 
cussed therein for whom age data were available. Mathematicians will 
be interested to know that the correlation ratio between these two vari- 
ables is —.61. 

Figure 2, like Figure 1, suggests that there is a lower age limit or 
threshold, prior to which important mathematical contributions are not 
likely to be made. This lower age limit occurs at a younger age level, 
however, than might have been anticipated by many, namely, somewhere 
in the late teens or early twenties, the exact lower limit depending upon 
the type, and perhaps even more upon the quality of the contribution 
that is under consideration. 

Figure 3 is based upon data obtained from 10 histories of physics. 
It reveals: (1) the age at which each of 388 deceased outstanding physi- 
cists made his first contribution deemed worthy of citation and discussion 
by one or more of the 10 historians, versus (2) the total number of con- 
tributions that was made by each of the 388 physicists. Comments al- 
ready made with reference to Figures 1 and 2 should enable readers to 
interpret Figure 3 without further aid. 

Table 1 presents statistical information regarding those who made 
many versus those who made few important contributions to chemistry. 
This table reveals that those who made larger numbers of notable chemis- 
try contributions did their first work at a younger average age (column 3) 
and their final work at an older average age (column 4) than did those who 
made fewer outstanding contributions. In other words, the more prolific 
contributors of outstanding research got ‘“‘on the beam” earlier and they 
stayed on longer. 

Tables similar to Table 1 were constructed for 20 different kinds of 
creative endeavor. Like Table 1, the latter all reveal that, as compared 
with the minor contributors, the major contributors to a given field ac- 
complished their first important research at younger average ages and 
their last important work at older average ages. 











60 85 





75 


sis 
33 
40 


30 


bution to mathematics versus the total number made by each. 


15 





Totel No, 
of Contri. 
5 
ho 
5% 





Harvey C. Lehman 
33 
Le] 
Fic. 2. The age at which each of 444 mathematicians made his first important contri- 


468 





Age of Starting to Contribute versus Total Creative Output 469 


Scattergrams, like Figure 1, were likewise constructed for 20 types of 
creative work. Space limitations preclude publication of the 20 scatter- 
grams and the data from which they were made. 

For 10 fields of work comparisons were made between the average 
total life output of individuals who made their first contributions at ages 
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Fic. 3. The age at which each of 388 physicists made his first important contribution 
to physics versus the total number made by each. 


15 to 19 inclusive, versus the average total output of others who started 
making their contributions to the same fields of endeavor from ages 20 
to 24 inclusive. Both the mean and the median output of those groups 
which started contributing at the younger age interval were 24% greater. 
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Table 1 
More Prolific versus Less Prolific Contributors to Chemistry 
Average Average A No. 
No. of Age at Age at of Years 
Contri. Time of Time of Between First . 
No. of Cited b First Last and Last 
Men Hildite Contri. Contri. Contri. 
135 1 only 36.27 36.27 0 
241 1-5 34.10 38.59 4.49 
38 6-10 27.95 48.05 20.10 
13 11-15 24.69 47.92 23.23 
4 16-20 25.25 50.75 25.50 
6 21 or more 22.17 60.17 38.00 





Sufficient data were available in 20 fields of endeavor to permit trust- 
worthy comparisons between the average output of groups that started 
to contribute at ages 20 to 24, versus other groups that started contribut- 
ing from ages 25 to 29 inclusive. Both the mean and the median output 
were 19% greater for those groups which started contributing when 
younger. 

The fact that delay in starting to contribute at almost any given age 
level is still likely to make a difference in total average output may be 
seen by inspection of the scattergrams here presented. The approximate 
amounts of the decrements up to age 50 are given, by 5-year intervals, in 
the following tabulation. 


Mean and median percentages of decrease in average total life output by groups 
which started their contributions at successive 5-year intervals. In each instance the 
comparison is made with the groups which started to contribute during the preceding 
5-year interval. 








Age Intervals............. 20-24 25-29 30-34 35-39 40-44 45-49 
No. of Groups............ 10 20 20 20 20 20 
Mean decrease............ 24% 19% 17% 14% 10% 14% 
Median decrease.......... 24% 19% 24% 19.5% 125% 145% 





Table 2 lists the 16 English authors who have the largest number of 
works listed in Ryland’s Chronological outlines of English literature (8). 
This table reveals that for these 16 very prolific contributors, the mean 
age at time of making their first contribution was 23.3 years; the mean 
age at time of making their last contribution was 58.0 years; and the 
mean interval between the first and the last contributions was approxi- 
mately 35 years. 
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One need not be an advanced student of English literature to realize 
that it would be difficult and perhaps impossible to find any other list of 
16 English authors as distinguished as those whose names are found in 
Table 2. Those who believe that the renowned author attains his pre- 
eminence without effort should examine carefully the number of con- 
tributions made by each writer whose name appears in Table 2. 

The names of 152 of Germany’s best-known, and probably most able, 
literary men were obtained from a highly select bibliography (9) which 
lists an average of only 3.0 works per author. The date on which each of 


Table 2 
Sixteen Most Prolific Contributors to English Literature 








No. of Age at Age at No. of Years 
Contri. Time of Time of Between First 
Cited by First Last and Last 
Individual Ryland Contri. Contri. Contri. 





Rs Wiis. oo ysbeee vices 24 49 25 
Sir Walter Scott 25 60 35 
28 74 46 
24 70 46 
31 15 
19 17 
21 56 
21 29 
22 36 
31 40 
20 46 
21 26 
18 62 
18 12 
24 28 
26 41 
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23.3 34.8 





these authors published his first work was then found in Kosch (10) who 
seems to have listed almost everything that these 152 authors published, 
namely, an aggregate of 2,935 works, the average number of works per 
author being 19.3. It is significant that 35% of the 152 distinguished 
German authors started to publish at age 24 or younger, and that 11% 
of them started to publish while still in their teens. 

A second group of 473 less distinguished German writers, listed by 
Kosch only, was next obtained by canvassing Kosch’s Lexikon from A to 
Ca, inclusive. The average output for this less distinguished group was 
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only 3.4 works per author. This is less than 20% of the average output 
of the more distinguished Germans. 

The less distinguished group of German authors is also characterized 
by a much smaller per cent of early starters. In contrast to the more 
select group of 152 German writers listed by Priest, of whom 35% started 
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Fic. 4. The age at which each of 152 highly distinguished German authors published 
his first work versus the total number of works published by each. 
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to publish at age 24 or younger, only 9% of the less distinguished German 
authors started to publish when as young as age 24 or less. (See Table 
3.) 

Figure 4 presents: (1) the age at which each of the 152 more eminent 
German writers published his first work, versus (2) the total number of 
contributions made by each. Figure 5 sets forth similarly: (1) the age at 
which each’ of 152 less distinguished German authors, listed by Kosch 
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Fic. 5. The age at which each of 152 less distinguished German authors published his 

first work versus the total number of works published by each. 








only, published his first work, versus (2) the total number of publications 
of each. In order to make Figures 4 and 5 more directly comparable, 
data for only 152 of the 473 less distinguished Germans, taken alpha- 
betically, are set forth in Figure 5. 

Table 3 makes possible a comparison of the average productiveness of 
the youthful German starters who later were identified by Priest as 


Table 3 
Age Data Regarding More Distinguished versus Less Distinguished German Authors 








Starters at Age 24 or Less 





Per cent Total Works Per cent of 
of Main Listed by Main Group’s 
Main Group iv. Group Kosch Total Output 


152 authors listed by both 





35% 1,598 45% 
473 less distinguished authors 
listed by Kosch only 9% 440 28% 
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highly distinguished, versus the productiveness of other youthful starters 
who failed to achieve such distinction. Notice in Table 3 that the 53 
more distinguished German authors who started contributing at age 24 
or younger produced a total of 1,598 works, the average being 30.2 works 
per author! This high average is almost the same as that of the 16 
prolific English authors listed in Table 2. 

The 43 other Germans who started to publish when equally young but 
who are not listed by Priest, published an average of slightly more than 10 
works each. Their average output is thus only about a third as great as 
is the average of the youthful starters who attained greater eminence. 
Their average output, nevertheless, is about three times as great as is the 
average of others in the less distinguished group who started to publish 
when beyond age 24. 

By way of summary it may be said that: (1) as compared with the less 
distinguished, the more distinguished group of German authors was far 
more prolific, (2) a larger percentage of the more distinguished group 
started to publish during their late teens and early twenties, and (3) 
within each group the more youthful starters exhibited greatest pro- 
ductivity. 

Since similar results were found when nume.. us other such compari- 
sons were made, it seems fair to conleude that, as compared with the 
average individual, our most distinguished creative thinkers have usu- 
ally possessed, among other things, an astonishing capacity for hard 
patient work. 

For example, with reference to three of the mathematicians, whose 
initial work is described by use of quotations in the forepart of this article, 
one authority in the history of mathematics remarks: ‘For prolific 
inventiveness Euler, Cauchy, and Cayley are in a class by themselves 
.. (1, p. 378). More specifically with regard to the industry of 
Leonhard Euler the same historian asserts: 

“The extent of Euler’s work was not accurately known even in 1936, but 


it has been estimated that sixty to eighty quarto volumes will be required for 
the publication of his collected works” (1, p. 139). 


Of Cauchy he present the following details: 


“During the last nineteen years of his life he produced over 500 papers on 
all branches of mathematics, including mechanics, physics, and astronomy. 
Many of these works were long treatises” (1, p. 289). “His total output is 
789 papers (many of them very extensive works) filling twenty-four large.quarto 
volumes” (1, p. 292). 


With reference to Cayley we find: 


“. . . his massive Collected Mathematical Papers (thirteen large quarto 
volumes of about 600 pages each, comprising 966 papers) will suggest profitable 
forays to adventurous mathematicians for generations to come”’ (1, p. 402). 
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After reading the above statements, one can hardly refrain from 
wondering whether Euler, Cauchy, Cayley, and the other prodigious 
workers, of whose industry fleeting glimpses may be obtained herein, 
were not criticized by some of their contemporaries for overproduction. 

A list of notable oil paintings was obtained by making a composite 
study of 60 different books which contain lists of so-called master paint- 
ings. This procedure utilizes the collective judgments of art critics and 
historians who have published evaluations under their own signatures 
and who must, therefore, have tried conscientiously to make sound 
evaluations. 

In what follows it is assumed: (1) that this large number of independ- 
ent critics have exhibited no constant prejudice for or against any one 
particular age group, and (2) that careful study of the frequency with 
which paintings have been listed by the various compilers should enable 
one to identify the really great paintings. 

The next step in the analysis of the data was to omit all paintings 
which were found to be listed only once or twice in the 60 compilations 
on the theory that this procedure would tend to eliminate eccentric judg- 
ments. That is to say, if a given painting were judged to be a great 
artistic work by only 1 or 2 or the 60 compilers, it was assumed that that 
particular painting would be less likely to possess genuine merit than 
would another painting which had been chosen by 3 or more of the 60 
compilers. Those paintings which were chosen by 3 or more compilers 
will be referred to hereinafter as ‘‘superior’’ paintings. 

When the data for artists who produced superior paintings were 
partitioned in various ways and analyzed, it was found that the more 
recently-born artists had achieved an average of only 2.93 superior paint- 
ings per individual, whereas, the earlier-born artists had achieved an 
average of 4.55 superior paintings. 

Further scrutiny revealed also that not less than 35% of the earlier- 
born, and more distinguished, artists did their first oil painting (not 
necessarily a superior one) at age 24 or younger, and that not less than 
18% of them did their first painting at age 19 or less. (See Figure 6.) 
In contrast to the more distinguished group, the more recently-born 
artists, who seem to have achieved a somewhat less enviable record than 
the earlier-born, are characterized also by a smaller percentage of very 
youthful starters, only 15% of the more recently-born group having 
started their painting career at age 24 or younger. (See Table 4.) 

Table 5, which presents data for composers of grand opera, was con- 
structed in a manner similar to Table 4. Table 5 reveals once again that 
the more distinguished of two groups of creative workers includes a 
larger percentage of talented youthful starters. 
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Fic. 6. The age at which each of 80 very distinguished artists did his first painting in 
oil versus the total number of superior pictures produced by each. 











Table 4 
Age Data Regarding More Distinguished versus Less Distinguished Painters in Oil 
Starters at Age 24 or Less 
Per cent Total No. Per cent of 
No. of of Main Superior Main Group’s 
Main Group Artists Group Paintings Total Output 
80 artists born prior to 1630 
who achieved an average of 
4.55 superior paintings... . . 29 36% 214 57% 
80 artists born from 1630 to 
1850 who achieved an aver- 


age of only 2.93 superior ' 
IG 5. dtaeuns <a cde 12 15% 47 20% 
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Table 5 


Age Data Regarding More Distinguished versus Less Distinguished 
Composers of Grand Opera 








Starters at Age 24 or Less 





Per cent Total Per cent of 
of Main No. of Main Group’s 
Main Group Group Operas Total Output 





88 men who composed 191 
superior grand operas each 
of which appeared 3 or more 
times on a composite list. . . . 36% 80* 42%* 
559 other men who produced 
1,723 operas of lesser 
15% 451t 26%t 





* Superior operas only. 
t Operas listed by Pratt, W.S8.: The new encyclopedia of music and musicians. New 
York: The Macmillan Co., 1924. Pp. vi + 967. 


When scattergrams for quite different kinds of creative achievement 
are based upon works of only very superior quality, they exhibit much 
similarity. For example, Figure 7 presents: (1) the ages at which each 
of 207 noted French and German philosophers published his first work 
(not necessarily a very superior one), versus (2) the total number of 
superior philosophical works by each, i.e., works;which were cited and 
discussed in 3 or more of 50 histories of philosophy. 

Although the data for these different types of creative work are 
sufficiently alike to support our main conclusions, one cannot judge 
accurately, by mere inspection of our tables and graphs, which kind of 
creative work comes earliest or which can be continued longest. This 
is because the several kinds of creativity cited herein have not been 
equated as regards their quality or merit. For example, the first and the 
last chemistry contributions may be somewhat more select (or somewhat 
less select) than are, let us say, the first and the last philosophical con- 
tributions. It is true also that information regarding first and - last 
contributions may not be equally available in every field. Unequal 
amounts of time lag between date of achieving and date of announcement 
thereof are also a possibility. For all of these reasons, the data herein 
should be regarded as approximate only, not as mathematically exact or 
directly comparable. 

As a supplement, the following literary exposition is perhaps d propos 
since the concluding statement is so abundantly validated by our statis- 
tics. 
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Fie. 7. The age at which each of 207 French and German philosophers published his 
first work versus the total number of superior works produced by each. 


“When pamneres flew to France—at just 25—every newspaper, had to 
dwell upon his youth. He was a mere kid. Yet he was as old as Keats was 
at death. He was a year older than Pitt was when he became prime minister 
of England. He was eight years older than Mendelssohn was when he com- 
posed his overture to A midsummer night’s dream. John Ericsson, who did 
many things besides build the Monitor, was a draftsman at 12 and a full-fledged 
engineer at 15. Chatterton finished at 18; Galois, the mathematician, at 20. 
Jane Austin was writing one of her best novels [Pride and prejudice] at 21. . . . 

“Anyone can leaf through a dictio of biography and make similar lists 
in a half hour. In other words, much of the significant record of the human 
race has been made by men and women scarcely older than the hundreds of 
thousands of students who mull along in crowd fashion, year after year, in our 
undergraduate colleges” (11, p. 16). 


This discussion will end with the observation that all of the renowned 
benefactors of humanity in the following list were less than 25 years old 
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(Lindbergh’s age when he flew to France) at the time they did their first 
important creative work. 


56 of the chemists as shown in Figure 1. 

53 mathematicians (see Figure 2). 

41 physicists (see Figure 3. 

53 distinguished German authors (see Figure 4 and Table 3). 

11 (or 69%) of the 16 English authors listed in Table 2. (For 520 less 
distinguished English authors listed by Ryland, who produced an 


wa of only 3.98 works per author, the corresponding figure is 
15%). 


29 noted painters in oil (see Figure 6 and Table 4). 
30 eminent French and German philosophers (see Figure 7). 
32 composers of superior grand operas (see Table 5). 


There is reason to suspect that the above sample findings, not one of 
which is exhaustive, can be duplicated easily in almost every creative 
field. On the whole, our findings suggest that those destined to go far 
have started early and moved rapidly. 

Although we know that the earlier starters produced on the average 
both more and better creative work than did those who started to contri- 
bute later in life, it would be even more valuable if we could ascertain 
whether the earlier start tends to make one’s best creative work, even 
better than it otherwise would be. It may be argued, with some justice 
perhaps, that it was the most brilliant individuals who started making 


their contributions earliest, and that both the quantity and the quality 
of their output were due solely, or chiefly, to their, unequaled brilliance 
rather than to their early start. Since the phenomena here considered 
cannot be subjected to rigorous controlled experimentation, one can only 
speculate with reference to this latter possibility. 


Received October 19, 1945. 
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The Use of the Harrower-Erickson Multiple Choice Rorschach 
Test with a Selected Group of Women in 
Military Service 


Capt. Marjorie Case Winfield, USMCR * 


When the plan to send members of the Marine Corps Women’s Reserve 
overseas was authorized a staging area was established where all person- 
nel were sent for a period of about two weeks immediately prior to being 
shipped overseas, for the purpose of medical and dental examinations, 
outfitting and other processing, a minimum amount’ of re-training and 
final screening. 

Since the tour of duty was to be a minimum of two years, with no 
state side furlough or leave during that time, and no guarantee of im- 
mediate return upon completion of the two years, it was important to 
send women Marines who could stand the monotony as well as the ex- 
citement, the limitations and frustrations as well as the adventures of 
such an assignment. Consequently, the qualifications for duty were 
set high. Anyone desiring an overseas assignment had to volunteer for 
it, and to be eligible a woman Marine had to have had a minimum of six 
months active duty in the Marine Corps, exclusive of recruit or specialist 
training. She must have had a good conduct record, a good health re- 
cord with no misconduct status, no courts martial, and a good work 
record. She must have been recommended by her Commanding Officer 
and she must have “demonstrated in her military service a sense of 
responsibility, maturity, adaptability and emotional stability.” 

These requirements automatically excluded the obviously misfitted, 
those with records of previous maladjustments of one sort or another. 
However, there was still the possibility that certain members of this 
selected group, under strange and more complicated circumstances, 
would become problem cases. To aid in the detection of potential mis- 
fits two tests, the Harrower-Erickson Multiple Choice Test (for use with 
Rorschach Cards or Slides) ' and the Minnesota Multiphasic Personality 
Inventory, (MMPI), were given to the first group of 181 enlisted women 
the day after their arrival in the staging area. 


*This paper is not to be construed as the official opinion or conclusions of the 
USMC. 

1 Henceforth in this paper, in order to avoid confusion with the Harrower-Erickson 
Group Rorschach Test, this test will be referred to as the Multiple Choice Rorschach 
or abbreviated MCR. ae 
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The sole purpose in giving these two tests to the group was to give the 
officers charged with the responsibility of the final selection some con- 
crete indication of those individuals least likely to maintain a satisfactory 
adjustment. It was intended that the women so indicated by the tests 
would be closely observed by the staging area staff during the two week 
staging period which, of necessity, was one of strain and tension. Neither 
test was intended to be used for differential diagnosis nor were they to be 
considered either separately or combined as sufficient reason to exclude a 
woman from going overseas, unless substantiated by other evidence 
obtained through careful observation. 


Administration and Scoring 


The MCR was given to groups of 35 enlisted women at atime. In- 
structions for marking the blanks were given as they are written on the 
cover page of the test form. Each Rorschach slide was thrown on the 
screen for a period of 114 minutes with the lights off, and for an equal 
amount of time with the lights on, making a total of 3 minutes for each 
slide. 

The MMPI was given after the MCR usually with about a half hour 
interim, with the instructions given as they are on the cover page of the 
booklet.? Scoring was done by machine. 

Prior to establishing the testing program the scoring method of the 
MCR was discussed with the author of the test whose opinion it was that 
the circumstances might warrant penalizing the individuals who checked 
“Nothing At All’ or who made no markings whatsoever in one of the 
A, B, or C sections, both of these answers indicating a failure. As a con- 
sequence of this discussion, two methods of scoring were used on all 
blanks: 

(a) Each poor answer, as indicated in the scoring instructions for the test,* 

was/scored simply as one poor answer. Each “‘failure’’ (i.e. “Nothing 
At All” checked or no mark made within an A, B, or C section of the 
blank) was scored also as o oo answer. 

r answer was counted in the same manner as above, but each 


poo 
“failure” (‘Nothing At All” checked, or no check) was counted as two 
poor answers. 


In both methods the total number of answers each individual had 
checked was computed and the percentage of her poor answers was 
considered to be her final score. 


*8. R. Hathaway and J. C. McKinley. Booklet for the Minnesota Multiphasic 
Personality Inventory. Minneapolis: University of Minnesota Press. 1943. 

* Harrower-Erickson and Steiner. Large Scale Rorschach Techniques. Springfield, 
Illinois: Charles C. Thomas. 1944. ' 
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The cutting point for both methods of scoring was established at 40%. 
According to the author of the test, individuals earning a score of less 
than 40% were to be considered “‘normal’’; those earning scores between 
40-59% were to be considered doubtful or questionable risks (the higher 
the percentage, the less stable the individual); those individuals with 
scores between 60% and 100% were to be considered highly doubtful. 


Discussion of Findings 


The original plan for final screening called for all women arriving at 
the staging area to be given both MCR and MMPI. However, after 
testing the first group of 181 women, it was decided to omit the MCR as 
it was the opinion of the staff that the test results appeared to be too 
ambiguous to be used for predicting future behavior with any degree of 
reliability. 

The results of the two methods of scoring are shown in Table 1. 


Table 1 


Distribution of Scores Made in the Two Methods of Scoring (Method I, Each Failure 
Counted as One Poor Answer; Method II, Each Failure 
Counted as Two Poor Answers) 
Multiple Choice Rorschach 
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Table 2 
The Median, Semi-interquartile Range, Mean and Standard Deviation for the 
Two Methods of Scoring 
N Md Q M sD 
Method I 181 19.60 8.54 24.20 17.35 
Method II 181 22.02 9.61 26.30 19.20 
Table 3 


Comparison of Scores Made Using the Two Methods of Scoring (Method I, Each Failure 
Counted as One Poor Answer; Method II, Each Failure 

















Counted as Two Poor Answers) 
Method I Method II 
Total no. above cutting point*...................... 26 (14%) 31 (17%) 
No. scores within range 60-104 (highly doubtful). ..... 9 (5%) 14 (8%) 
No. scores 40-59 (doubtful)... .................4.4.. 17 (9%) 18 (10%) 
Dee ES hs 6 Sirens ne cs le b's Us WR ee eke 155 (86%) 150 (83%) 
* Cutting point = 40. 
Table 4 
Number of Scores Raised When the Failures Were Counted as Two Poor Answers 
No. Per Cent 
es I sak Soc baks cotbvdsobubiue babs bes 65* 36 
Deemer of soowen Met waleed. ww i ccc ck ccc cc cecces 116 64 
Number raised within any given group....................... 55 30 
Number raised from one range to another... .................-. 9 5 
Number of scores raised within: 

SS ti gtak wUr ets er bed k Sas bod ews oc avsebs ne veaknin 46 25 
TERRE SOR AUS chap poe 0” 8” SM a a, pe 6 3 
ad acrid ah once oe aa dls + dae A ie ie SRW Vad o Wome 3 2 

Number of scores raised from range: 

a ia hao ic bh Se WR ah UE Seis win ees 5 3 

Sr i ak Fi hs a ree ae ib no UV Sakic cecccces 4 
Number of scores raised two ranges... ..............2-00-000- 0 





* 1 score was raised to above 100%. 


In the cases where the MCR scores were raised from below to above 
the cutting point as shown in Table 5, only one woman made significant 
deviate scores on the MMPI. Both her hypomania and schizophrenia 
scales have critical scores. However, her F or validity score is not only 
a critical one but sufficiently high to consider the possibility of the in- 
dividual’s having purposely attempted to make a bad set of scores, 
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particularly since the scores made on the depression and hysteria scales 
are the only ones not closely approaching the cutting score. 

As may be seen from Table 1, when each failure was counted as 1 
poor answer, there were 26 women with scores above the cutting point, 
9 of whom were “highly questionable,’ that is, having scores between 
60% and 100%. The second method of scoring produced 31 women with 
scores above the cutting point, 13 of whom were “highly questionable.” 
Considering only those scores at the very extreme in the Ist method of 
scoring, those between 60% and 100%, even in a random sample, which 
this is not, it would be surprising to find that 5%, or 9 individuals out of a 
sample of 181 earned such extremely significant scores unless they were 
deiiberately trying to do so. But in a highly selected group of women, 
where the motivation and incentive to do well on the test reduces to a 


Table 5 


Multiple Choice Rorschach Scores Raised Above the Cutting Point by the Second 
Method of Scoring and the Scores Made by the Same Five Individuals 
on the Minnesota Multiphasic Personality Inventory 








Multiple Choice Minnesota Multiphasic Personality Inventory 
Rorschach* 


% Poor Answers F Dem uN Mm. 
MethodI Method II 





37% 47% 51 49 56 52 
43 47 & 5i 39 
47 40 47 42 39 
40 49 54 40 40 
44 56 52 68 75 





* Cutting point = 40. 


minimum the chances of their deliberately trying to earn poor scores, 
these extreme scores become even more surprising and somewhat less 
understandable. 

By similar reasoning we would ordinarily expect individuals with such 
highly significant scores as 80% to 100% to produce at least significant 
deviate scores on other tests for detecting personality disturbances. 
However, as will be seen in Table 6, those individuals with significant 
MCR scores did not have any significant scores on the 8 major scales of 
the MMPI; nor did the author or other members of the staff observe any 
behavior symptomatic of poor adjustment or instability. It might mean, 
of course, that the MCR is a supersensitive measuring instrument, 
capable of detecting potential behavior deviations beyond other tests and 
beyond the elements of selectivity used in this group, except that if this 
were the case and we started our reasoning with the ten individuals who 
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made significant scores on the MMPI, then we should expect to have the 
“more sensitive instrument” showing deviations for these people also. 
However, inspection of Table 7 does not substantiate such an expecta- 
tion. Not once, by the first method of scoring, did the ‘‘more sensitive 
instrument” uncover potential maladjustment in the individuals so 
indicated by the MMPI. On the other hand, by the second method of 
scoring the only individual who earned a significant deviate major scale 
score is the woman with the F score of 80 which probably indicates in- 


validity. 


In the ten cases of deviation on the MMPI, as shown in Table 7, it is 


Table 6 


Significant Multiple Choice Rorschach Scores and the Scores the Same Individuals 
Made on the Minnesota Multiphasic Personality Inventory 








Maes Choice Minnesota Multiphasic Personality Inventoryt 
MethodI Method II L F Hs D Hy Pd Pa Pt Sc Ma 
95% 95% 70 8 4 47 5S 51 5 34 37 48 
88 88 6 5 39 4 5 47 49 37 42 66 
87 103 cS ee ae ee re 50 37 43 £57 
81 89 628 50 4 5 61 4 50 39 #40 5 
80 80 5 5 39 3% 31 #47 5S & 52 68 
74 74 6 50 4 38 S57 49 5&8 39 42 & 
63 66 5 8 41 42 438 37 38 39 #438 «57 
63 77 5S 55 39 6 45 47 5 8 4 39 
62 62 5 8 37 4 40 40 8 43 44 62 
58 87 5 58 37 40 48 49 4 3 44 = 59 
53 53 56 464«80 COC CHG CC ia Ch 
53 60 60 50 37 38 49 58 47 41 40 66 
§1 53 5B 6 5 51 499 499 6 5 60 68 
50 57 565 50 39 4 & 5 41 3 39 48 
50 50 5 55 39 5 SO 5 5 37 4 & 
50 50 6 6«86Cti<‘ HK CCC ae tiCOC«é‘é*GDVN 
50 50 565 5 39 #499 58 4 6 36 39 #43 
49 51 50 50 39 38 5&0 40 41 37 +39 63 
46 53 70 64 39 49 #4 42 #35 41 40 & 
46 46 6 5 39 4 8 4 47 3 4 £457 
44 70 5 50 39 47 8S 47 5 3 40 50 
43 48 80 5 41 47 SH 4 56 36 42 43 
42 42 53 50 41 3 40 30 41 41 42 & 
41 66 60 58 37 47 42 S51 44 42 49 48 
40 40 68 50 37 5 4 49 #47 #3 49 5&O 
40 43 53 55 37 5 5 42 580 47 49 387 





' * All scores on Multiple Choice Rorschach of 40 and above considered significant. 
t All scores on Minnesota Multiphasic Personality Inventory of 70 and above con- 


sidered significant. 
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interesting to note that eight out of the ten deviations are on the Ma 
(hypomania) scale. An Mascore of 70 to75 is not ordinarly interpreted 
as being necessarily undesirable of maladjustive, providing the rest of the 
profile is good. And considering the unusual set of motivation factors in 
this situation, it is not at all surprising, or even “‘deviational,” to find 
significant scores made on the scale which indicates persons who are over 
productive in thought and action, who are full of vigor, ambition, plans 
and enthusiasm. Before volunteering they were all told that life overseas 
would be more rugged, harder, less convenient, less comfortable than in 
the States; that there was a great deal of work to be done and that prob- 
ably they would all be required to put in many extra hours work. In 
other words, it was put up to them as a challenge; and therefore it prob- 
ably would be surprising only if there were no Ma scores of 70 or over. 


Table 7 


Significant Minnesota Multiphasic Personality Inventory Scores and the Scores the 
Same Individuals Made on the Multiple Choice Rorschach Test 








Minnesota Multiphasic Personality Inventory* . Multiple Choice 


Rorschach 
He: DD, >-Hy Pd Pa. &8c...Ma Method I Method II 


45 
41 
62 
40 
43 
45 
39 
41 
39 
39 





47 4 8% 438 38 72 33% 33% 
49 56 44 438 49 77 30 30 
52 68 67 68 75 72 28 44 
52 56 73 38 54 63 27 30 
56 47 57 52 47 72 24 24 
49 61 41 & 53 8&4 19 19 
aczweeSE @® @® 7 17 17 
56 6«63:CO44CSOsiCiHHGCesCS77200” 17 17 
36 51 47 #49 #45 = 75 16 16 
42 42 3 38 44 72 10 10 


L 
56 
50 
50 
50 
50 
50 
53 
50 
50 
53 


Elegsessssssss|= 
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standard scores of 70 and above are considered significant. 


Conclusion 


Since there was no correspondence between the scores made on the 
Multiple Choice Rorschach (MCR) and the Minnesota Multiphasic 
Personality Inventory (MMPI), nor any observed behavior which war- 
ranted a diagnosis of maladjustment such as the extreme scores made on 
this test would indicate, it must be concluded that the MCR differentiates 
something other than it purports to do and that further research and 
standardization are necessary before the test can be used on a similarly 
selected sample for the screening of maladjusted individuals. 


Received October 8, 1945. 








The Relationship Between Subjective Estimates of Personal 
Adjustment and Ratings on the Bell 
Adjustment Inventory 


Jacob Tuckman 
Jewish Vocational Service, Montreal, P. Q. 


The recognition that personal maladjustment is a contributing factor 
to job maladjustment has pointed up the need for the vocational counselor 
to gather information regarding the ability of the counselee to adjust to 
others, and the extent to which he may be burdened with personal prob- 
lems. Such information is sometimes obtainable from the school, par- 
ents, social agencies which may have had contact with the counselee or 
his family, as well as from observation during counseling interviews. In 
addition, however, it has been found advisable to include in the test 
battery, consisting of tests of intelligence, achievement, interest, and 
specific aptitudes, some measure of adjustment. Although many tests 
under the broad heading of personality, adjustment, and character have 
been developed, validity has been generally too low to warrant their use 
for individual diagnosis. These tests are too often dependent upon the 
honesty and insight of the counselee. Nevertheless, they have been 
found useful in providing clues to an individual’s adjustment, and in 
serving as points of reference in the interview. 

One widely used test to determine an individual’s adjustment is the 
Bell Adjustment Inventory, designed for use with students and adults. 
The student form, consisting of 140 questions to be answered by “Yes,” 
“No,” or “?,” gives scores in four adjustment areas;—Home, Health, 
Social, and Emotional, as well as the total adjustment based on the sum 
of the four separate scores. The adult form, consisting of 160 questions, 
yields an ‘“‘Occupational’’ adjustment score in addition to the other four 
measures, but is not scored for this adjustment area if the individual is 
not or has not been employed. Either inventory can ordinarily be com- 
pleted in twenty to thirty minutes. 

At one point in the testing program at the Cleveland Jewish Voca- 
tional Service, when time limitations presented a serious problem, we 
became interested in determining whether a shorter, more direct approach 
to the measurement of a counselee’s adjustment derived from the Bell 
Adjustment Inventory could be devised which would be a satisfactory 
substitute for that test. A set of five statements covering the same areas 
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and degrees of adjustment as are included in the Bell Adjustment In- 
ventory was drawn up, on which the counselee was to rate his adjustment 
on a five-point scale. The shortened scale, hereinafter referred to as the 
Adjustment Questionnaire, is as follows: 


ADJUSTMENT QUESTIONNAIRE 
Directions: 
Under the following headings, underline the answer that you feel applies to you. Compare yourself with 
other people you know in choosing your answer. There are No right or wrong answers. 
1. MY HOME LIFE IS: 


EXCELLENT GOOD AVERAGE UNSATISFACTORY VERY UNSATISFACTORY 
2. MY HEALTH IS: 

EXCELLENT GOOD AVERAGE UNSATISFACTORY VERY UNSATISFACTORY 
3. MY ABILITY TO GET ALONG WITH OTHER PEOPLE IS: 

EXCELLENT GOOD AVERAGE UNSATISFACTORY VERY UNSATISFACTORY 
4. MY HAPPINESS IN LIFE IS: 

EXCELLENT GOOD AVERAGE UNSATISFACTORY VERY UNSATISFACTORY 
5. MY SOCIAL CONTACTS ARE: 

EXCELLENT GOOD AVERAGE UNSATISFACTORY VERY UNSATISFACTORY 


The statements on Home and Health adjustment presented no difficul- 
ties, but it was felt that those on Social and Emotional adjustment 
needed different phrasing. Statements three and five covering Social 
adjustment seemed more appropriate than the use of either statement 
alone. Statement four seemed to cover the Emotional area adequately. 

The subjects for the study were 191 high school boys, 200 high school 
girls, 51 men, and 45 women, who were referred for testing by the counsel- 


ing and placement departments of the Cleveland Jewish Vocational 
Service. The school group was enrolled in a college preparatory course 
in various high schools, or was enrolled in junior high schools normally 
leading to such a course of study. The group was about equally dis- 
tributed in grades nine to twelve. Both boys and girls were superior in 
intelligence as measured by the American Council on Education Psycho- 
logical Examination for High School Students (1941 and 1942 editions). 
The adult group consisted of unemployed men and women who were in 
need of job counseling. The majority were high school graduates. The 
men were superior in intelligence, the women average, as measured by the 
Pressey Senior Classification Test. The Adjustment Questionnaire was 
given before the Bell Adjustment Inventory to half of the subjects and 
after the Bell to the other half. 

For each of the four groups, contingency coefficients for the ratings on 
the Bell Adjustment Inventory and the Adjustment Questionnaire were 
computed and corrected for broad groupings. These are presented in 
Table 1. The corrected contingency coefficients, with the exception of 
the Emotional adjustment for boys, are surprisingly high. There is little 
sex difference. The contingency coefficients for the adult groups are 
higher than the school groups in all areas. 
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Table 1 


Contingency Coefficients for Ratings on the Bell Adjustment Inventory and the 
Adjustment Questionnaire for Boys, Girls, Men and Women 











Boys Girls Men Women 
N = 191 N = 200 N = 51 N = 45 
Adjustment 
Area C Corr. C* C_ Corr. C* C Corr. C* C_ Corr. C* 

Home .529 59 .550 62 .657 .74 .679 .76 
Health 412 46 .366 Al .722 81 .587 66 
Social (3) 536 .60 423 48 .580 65 .664 75 
Social (5) 449 50 549 62 537 .60 .675 -76 
Emotional 281 .32 526 .59 599 ~3=—«.67 534 .60 





* Corrected for broad groupings by dividing C by .89. See Peters, Charles C., and 
Van Voorhis, Walter R. Statistical procedures and their mathematical bases. Pp. 393-399. 


Table 2 gives the per cent of each of the four groups whose ratings on 
the Bell Adjustment Inventory and the Adjustment Questionnaire are 
identical. These vary from 22% for the Health adjustment for boys, 
to 44.4% for the Home adjustment for women. The groupings regarding 
the degree of adjustment for the Bell Adjustment Inventory are fairly 
arbitrary, since one point in score may change an individual’s rating from 
one category to another. For practical purposes, we are primarily 
interested in knowing whether a person’s adjustment is average, above 
average, or below average. Table 3 presents the per cent of the four 
groups who were above average, average, and below average, for the 
various adjustment areas on both measures. As may be expected, the 
per cent of the groups falling within the same category on both measures 
is considerably higher than that obtained on a five-point scale. These 
vary from 37.2% for the Health adjustment for boys to 64.4% for the 
Social adjustment (5) for women. In all areas, the per cents are higher 
for girls and women than for boys and men but these differences are not 
statistically reliable. 


Table 2 


Per Cent of the Four Groups Whose Adjustment Rating on the Bell Adjustment 
Inventory and the Adjustment Questionnaire was Identical 








Adjustment Bo Girls Men Women 
Area % % % % 
Home 27.2 37.5 35.3 44.4 
Health 22.0 29.0 35.3 33.3 
Social (3) 39.8 40.5 27.5 24.4 
Social (5) 42.9 43.0 37.3 40.0 


Emotional 28.3 33.5 33.3 40.0 
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Table 3 


Per Cent of the Four Groups Whose Adjustment Rating was Above Average, Average, 
and Below Average on the Bell Adjustment Inventory and 
the Adjustment Questionnaire 











Adjustment Boys Girls Men Women 
Area % % % % 


Home 45.0 57.0 56.9 
Health 37.2 40.0 49.0 
Social (3) 55.5 57.5 37.3 
Social (5) 58.1 58.5 51.0 
Emotional 36.6 45.0 49.0 








In comparing the adjustment on both measures, each of the four 
groups has rated itself more favorably on the Bell Adjustment Inventory. 
Of special interest therefore, are cases where the adjustment on the 
Adjustment Questionnaire is poorer than that given by the Bell Adjust- 
ment Inventory. In some cases there is a wide discrepancy,—e.g., 
excellent on the Bell Adjustment Inventory, and very unsatisfactory on the 
Adjustment Questionnaire. The per cent of each group whose adjust- 
ment is poorer on the latter is presented in Table 4, and shows variations 
from 4.7% for the Health adjustment for boys to 28.9% for Social 
adjustment (5) for women. With the exception of the Health adjustment 
for women, girls and women tend to have a poorer adjustment on the 
Adjustment Questionnaire than do boys and men, but these differences 
are not significant. 


Table 4 


Per Cent of the Four Groups Whose Adjustment was Poorer on the Adjustment 
Questionnaire than on the Bell Adjustment Inventory 








Adjustment Boys Girls Men Women 
Area 


0 Jo 7% % 





Home 8.4 11.0 7.8 }1.1 
Health 4.7 6.5 11.8 11.1 
Social (3) 11.0 19.5 5.9 15.6 
Social (5) 16.2 22.5 19.6 28.9 
Emotional 11.5 13.5 19.6 22.2 





Summary 
The data indicate that it is possible to obtain a fairly good estimate of 


an individual’s adjustment by merely using a few simple direct questions. 
For practical purposes, the correlation between the two measures is high 
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enough to warrant the substitution of the Adjustment Questionnaire for 
the Bell Adjustment Inventory if there is not sufficient time to give the 
latter. 

The incidence of a sizeable proportion of cases whose adjustment on 
the Adjustment Questionnaire is poorer than on the Bell Adjustment 
Inventory is especially noteworthy. This does not mean that the Bell 
Adjustment Inventory is invalid, but that an individual’s subjective 
judgment as to what he estimates his own adjustment to be is also im- 
portant psychologically. Certainly, it is desirable, if time permits, to 
include a questionnaire of this sort along with the Bell Adjustment 
Inventory to give the counselor additional insight into the situation. 


Received November 5, 1945. 








Item Difficulty of Some Wechsler-Bellevue Subtests 


A. I. Rabin, J. C. Davis, and M. H. Sanderson 
New Hampshire State Hospital, Concord, N. H. 


It is the experience of every psychometric examiner with most testing 
scales that some items which are supposed to be “easy” in view of their 
placement in the beginning of the scale are failed, while the more “‘diffi- 
cult’’ ones, placed toward the end of the scale, are passed. The occasional 
occurrence of such inconsistencies does not disturb the examiner and is 
attributed to the idiosyncracies of the testee and to the peculiarities of 
his mental development, whether in the normal or abnormal range. This 
is also to be expected in view of the fact that final placement of items in 
order of difficulty is usually based on numerical summaries and statistical 
data of large standardization groups which mask individual patterns. 
However, when such “inconsistencies” occur with some degree of regu- 
larity and consistency, the correctness of the order of items is to be 
questioned, at least, for the sample of the population involved. 

Extensive experience in the application of the Bellevue Intelligence 
Scales with normal and abnormal adults raises the question of the correct 
placement of items within the several subtests. We have observed, 
especially in such subtests as information, picture completion, compre- 
hension, similarities and others, that some of the items that appear in the 
first part of the respective subtests are much more difficult for our sub- 
jects and are failed with greater frequency than those appearing in the 
middle or last part of those tests. Quantitative proof and substantiation 
of these empirical notions appeared desirable. 

Wechsler’s last edition of the Measurement of adult intelligence! shows 
an implicit recognition of the need for analysis of item difficulty. In fact, 
Wechsler adopts modifications in the order of presentation of the In- 
formation questions, based on unpublished data communicated by 
Altus (p. 172). 

An intra-test analysis of item difficulty would accomplish two major 
services in the clinical situation. In the first place, a more correct and 
statistically justified arrangement of items would make the test a more 
efficient tool. Several consecutive failures on successive items would 
actually mean little chance for success beyond that level. Thus time 

* Wechsler, D. The measurement of adult intelligence (Third edition). Baltimore: 
The Williams and Wilkins Co., 1944. 
493 
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would be saved and further useless questioning obviated. Secondly, it 
might be less discouraging to some sensitive testees who have to experi- 
ence failures before reaching the actual limit of their capacities. 


The Subjects 


The present study is based on three hundred Wechsler-Bellevue re- 
cords of individual examinations of normal persons, drawn from three 
sources: 1. 100 subjects were student nurses at the New Hampshire State 
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Hospital School of Nursing; 2. 40 were members of a Conscientious 
Objectors unit stationed at the Hospital; and 3. 160 were vocational 
guidance cases. Some were referred to the Psychology Department for 
vocational advice, while others were private clients of the senior author. 
The mean IQ (full scale) for the entire group is 104.4. 

‘It is obvious, from Figure 1, that our group is better than average 
intellectually. Very few extremely retarded cases were included (5% 
feebleminded and about the same percentage of borderline mentality). 
The overwhelming majority consisted of individuals with better than dull 
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normal mentality. Though the distribution covers a wide range, it is 
clearly skewed to the left. 


Procedure 


An analysis of the difficulty of the items in six subtests of the Wechsler- 
Bellevue Scales was undertaken. The items of the information test (25 
items) as mentioned earlier, on the basis of clinical observation and 
experience, are certainly not arranged in order of difficulty. Similarly, 
two other verbal subtests exhibited considerable scatter, i.e., comprehen- 
sion (10 items) and similarities (12 items). Of the remaining regular 
verbal subtests, the arithmetic test shows little irregularity, while the 
digits (forward and backwards) test does not lend itself to a difficulty 
analysis, since the gradation of difficulty depends on quantitative (num- 
ber of digits repeated) rather than differences in difficulty of individual 
items. 

Three performance subtests lend themselves readily to analysis: 
picture completion (15 items), picture arrangement (6 items) and block 
designs (7 items). The object assembly test was not included, since only 
three items comprise the test. These seem to be arranged in order of 
difficulty. Because of the quantitative, uniform nature of the test, digit 
symbols were not included. 

Thus, the following analysis is applied to three Verbal and three 


Performance tests which can be subinitted easily to item by item scrutiny. 
The procedure was to count the number of subjects (of the total group of 
300) passing each item on each test. The corresponding ranks of items 
were then computed on the basis of the ease with which they were passed 
by our subjects. Thus rank 1 is assigned to an item within a test which 
showed the highest number of successfully passing subjects; rank 2, next 
to the highest number, etc. 


The Data 


Wechsler’s original order of the items of the Information test is given 
in the first column of Table 1. The fourth column presents the revision 
of the order of the first 20 items based on Altus’s data and published in 
the third edition of Wechsler’s book.? Our rank order for the items, 
based on the data in the second column, may be found in column 3 of 
Table 1. The ranks in the parentheses are based on the actual data and 
were caused by the special circumstances referred to in the footnote to 
the table. Otherwise, the rank order is as indicated. The first six items 
agree perfectly with the order suggested in Wechsler’s recent revision. 


* Op. cit. 
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Table 1 
Data on Items of the Information Test 
Question No. No. of Rank Revised Order 
(Wechsler) Passes Order (Altus) 
1 227 (9)* 1 1 
2 271 (3) 4 4 
3 254 (4) 5 5 
4 272 (2) 3 3 
5 291 (1) 2 2 
6 237 (5) 6 6 
7 211 13 7 
8 228 (8) 9 16 
9 229 (7) 8 9 
10 226 10 10 
11 143 18 12 
12 231 (6) 7 11 
13 214 12 14 
14 78 20 17 
15 168 16 20 
16 181 14 8 
17 218 11 13 
18 118 19 15 
19 175 15 19 
20 150 17 18 
21 45 23 21 
22 49 22 22 
23 66 21 23 
24 8 25 24 
25 19 24 25 





* Most of the subjects were tested during the late President Roosevelt’s 2nd and 3rd 
terms; hence, the difficulty in naming his predecessor which is required on this item. 


The remainder show considerable variation. It is quite obvious that 
item 14 (discoverer of the North Pole) is badly misplaced since it ranks 
20th in difficulty. The same is true of items 11 and 17. Lesser dis- 
crepancies can be found in the remainder of the order. 

Roughly speaking, the test may be subdivided into three sections: 
questions 1-10, 12 and 17 which are easiest and are passed by more than 
2érds of our population. Then questions 14 and 21-25 are by far the 
most difficult and are passed by approximately }th of our population. 
The remaining items, passed by one-half of our subjects, do not provide a 
very gradual transition to the most difficult items. 

It is quite clear that unless the revised order of items is followed in 
the examination, the rule of discontinuing the questions after five suc- 
cessive failures may cause too many errors. 





Item Difficulty of Wechsler-Bellevue Subtests 497 


A difficulty analysis of the other two verbal subtests is given in Table 
2. Here, too, some revision of order is desirable. Items 2 and 8 of the 
comprehension test, which are practically of the same difficulty, are 5 
ranks apart. Items 9 and 10 also need to change places. The sudden 
drop in the last two items is quite interesting. While nearly 85% of the 
subjects are able to pass the first 8 items, only 45% and 65% of cases are 
able to pass items 9 and 10, respectively. Here, too, a more gradual drop 
would be desirable. When using the scale, however, if any of items 6, 7, 
or 8 are failed, no successes on the last two items can be expected and 
therefore the time required for questioning may be saved. 


Table 2 
Item Analysis of Two Verbal Tests 











Comprehension 





4 5 6 7 8 9 10 
265 280 245 237 254 134 194 
4 3 7 8 5 10 9 





Similarities 





1 3 4 5 6 7 8 9 10 Il 12 
No. of Passes 282 289 289 284 291 173 151 216 181 160 63 72 
Rank Order 5 35 35 1 2 8 10 6 7 a: eae 





Minor changes in the similarities test are also suggested by the analy- 
sis of item difficulty. There does not seem to be a great deal of variation 
in the difficulty of the first five items. The following five items with the 
exception of No. 8 again show a considerably higher level of difficulty. 
The last two items show a level of difficulty several times higher than the 
group of items preceding it. In summary, the test may be subdivided 
into three blocks with a rather sudden transition from one to another: 
Block One, consisting of items 1 to 5, with about 95% of the subjects 
passing; Block Two, consisting of items 6, 7, 9 and 10, with about 55% 
of the subjects passing; and Block Three, consisting of the remaining 
items (11 and 12), with only about 23% successes. The only seriously 
misplaced item is No. 8, which should be placed 6th in order. 

The results of an analysis of the three performance tests are given in 
Table 3. The items in the picture completion test show varying degrees 
of misplacement all along the line. If rearranged in the obtained rank 
order based on the number of subjects who passed each item, there would 
be a fairly gradual increase in difficulty. Of course, ordinarily all items 
are administered. However, time and effort may be saved if the test is 
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Table 3 
Analysis of Three Performance Tests 











Picture Completion 


Items 1 2 3 4 5 6 7 8 . a a a oe ae 2 
No. of 
Passes 291 278 271 233 173 277 185 242 275 198 105 237 126 76 123 
Rank 
Order 1 2 5 en ee ee 4 S| le es: oe Oe 


Picture Arrangement 











Picture 
Sets 1 2 3 4 5 6 
No. of 
Passes 295 265 271 207 188 184 
Rank 
Order 1 3 2 4 5 6 





Block Designs 


Designs 1 2 3 4 5 6 7 
No. of 
Passes 293 284 287 245 262 167 146 


Suggested 
Order 1 3 2 5 4 6 7 








discontinued after 3 or 4 consecutive failures, providing the items are 
presented in order of difficulty. 

Little criticism may be levelled against the picture arrangement test. 
There are minor but insignificant variations. No changes in order can 
be considered essential. 

The third part of Table 3 presents the results for the Block Design 
tests. According to these data, an interchange of places between de- 
signs 4 and 5 would be quite desirable. Otherwise, the order of their 
arrangement appears to be quite reasonable and characterized by in- 
creasing difficulty. 

Substantially similar results and, consequently, substantially similar 
suggestions for intra-test rearrangements were obtained for 1,000 State 
Hospital patients. Our suggestions, however, are designedly not based 
on the findings with this much larger sample of a wider age range and 
wider range of levels of ability, since such an analysis of item difficulty 
may reflect the “scatter” and selective performance of the patients and 
may not hold strictly for a “normal” population. The present results are 
nevertheless corroborated by those findings and are given greater strength 
and conclusiveness by them. 
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Summary 


Clinical experience during the administration of the Wechsler- 
Bellevue scales prompted a detailed quantitative analysis of the difficulty 
of items based on 300 records of normal individuals. The analysis was 
confined to the following six subtests which lent themselves to such 
statistical treatment: information, comprehension, similarities, picture 
completion, picture arrangement, and block designs. The suggested 
changes in the original order of item presentation are summarized in 
Table 4. The results are also largely substantiated by findings on 1,000 
psychiatric patients not included in this study. It is felt that the findings 
may help speed up the administration of the test and avoid a frequent 
excessive sense of failure on the part of less gifted examinees. 


Received October 8, 1945. 
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The Relationship Between Knowledge of Human Develop- 
ment and Ability to Use Such Knowledge * 


John E. Horrocks 
Ohio State University 


A practitioner is one who applies knowledge and skill to practical 
situations. It is commonly assumed that ability to apply stems from a 
background of knowledge and experience. Hence, to apply is to know,— 
to be steeped in principles and facts germane to the discipline being 
applied. That more than knowledge of facts and principles enters the 
picture is admitted. A pressing question in clinical practice and in 
teaching is the relationship that knowledge of facts and principles bears 
to ability to apply those facts and principles. This question has re- 
mained largely unanswered. 

Practically it would seem to follow that a person may not apply what 
he does not know.. For that reason, and for the added reasons of ad- 
ministrative expediency, time, and expense, courses for the training of 
clinical psychologists and teachers have tended to concentrate on factual 
retention. Thus, those who in their professional capacities are called 
upon daily to deal with the intricacies of human behavior, bring to their 
work a background of factually oriented courses in psychology and 
education. 

Purpose of the Study 


It is the purpose of this study to examine the relationship existing 
between knowledge of facts and principles and ability to apply such 
facts and principles to one area of human development,—adolescence. 
A psychologist or teacher has to deal with human behavior in complex 
life situations. He must diagnose and take remedial action in circum- 
stances complicated by inter-related and constantly evolving situations. 
He must apply what he knows about human behavior to the occasion 
which confronts him, bearing in mind all the while the antecedents of the 
occasion and the consequences of his decisions. This study will analyze 
the performance of a selected population, whose knowledge of the facts 
and principles of adolescent development has been tested, when con- 
fronted with a complex situation in which they are given an opportunity 
to make a diagnosis and select proper remedial procedure. 


_ * Grateful acknowledgment is made to Dr. Maurice E. Troyer of Syracuse Univer- 
sity. 
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Reliability and Validity of Instruments Used 


In carrying out the purpose of the study, four tests were constructed. 
The first was a criterion test which measured knowledge of fact and 
principle about adolescent development. Second, three case study tests 
were constructed which measured ability to apply the facts and principles 
of adolescent development. Each case study was divided into three 
parts. After each part the student was given an opportunity to reveal 
his ability to diagnose difficulties and select appropriate remedial pro- 
cedures. Answers to the case study tests were scored with a key based 
on a weighted composite of expert opinion. 

The criterion test was designed to cover the major principles and facts 
ordinarily included in texts and adolescent psychology courses. The 
subject matter coverage of the three case studies paralleled that of the 
criterion test. Each case centered around a different problem in adoles- 
cence,—social, academic, and emotional. 

The reliability coefficient obtained by the split-half method for the 
criterion test was .91+ .017. The validity of the criterion test was 
based on its reliability, internal consistency, coverage, construction, and 
keying. 

The split-half correlation obtained for the Case of Barry Black was 
.79 + .038; for the Case of Sam Smith, .73 + .046; and for the Case of 
Connie Casey, .77 + .041. The validity of the three case studies rested 
upon construction and coverage, expert scoring, item consistency, reli- 
ability, and utility. 


Procedure 


The criterion test and the three case studies were administered to 
_ populations of college juniors, seniors, and graduate students taking 
courses having to do with adolescent behavior, educational psychology, 
and mental hygiene. 

The Case of Barry Black was administered to a group of 100 college 
students, composed of 90 liberal arts and teachers college juniors and 
senoirs, 7 graduate students, and 3 nurses in training. Thirteen of the 
100 were experienced teachers. All people selected were midway through 
a college course having to do with human behavior. 

During the class meeting prior to the administration of the Case of 
Barry Black, the class answered the criterion test. 

The same procedure was followed in the administration of the cases of 
Sam Smith and Connie Casey to groups of 100 each. For the Case of 
Sam Smith the group selected consisted of 42 teachers college juniors and 
seniors, and 58 liberal arts college juniors and seniors. For the Case of 
Connie Casey, 9 graduate students, 15 registered nurses (taking under- 
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graduate work in public health nursing), and 76 liberal arts and teachers 
college juniors and seniors were selected. 

Product-moment coefficients of correlation were computed between 
the criterion test and the whole, diagnosis, and remedial scores for each 
of the three case studies. ‘The correlations existing between the criterion 
test and the three case stuclies are given in Table 1. 


Table 1 


Coefficients of Correlation Existing Between the Criterion Test and the Three 
Sections of Black, Smith, and Casey 





Variable Whole Diagnosis Remedial 





Black-criterion 46 + .078 AQ + .075 .29 + .091 
Smith-criterion 41 + .083 40 + .084 30 + .091 
Casey-criterion .26 + .093 .24 + .094 .26 + .093 





The three populations used above may be considered by virtue of 
selection and composition to be representative of a particular group of 
college students. They were all parts of larger populations,’ and they 
consist, for the most part, of college juniors and seniors with approxi- 
mately the same general educational experience and age. In all cases the 
tests were given midway through the course. For that reason it may be 


assumed that differences existing among correlations between the criterion 
test and the three case studies are due to chance or differences in the case 
studies rather than to differences’in the populations being considered. 

An examination of Table 1 indicates that, Barry Black and Sam Smith 
correlate more highly with the criterion test-than does Connie Casey. It 
will also be noted that the correlation of Barry Balck and Sam Smith with 
the criterion test are approximately equal. 

Where the diagnosis section is concerned, the same general trend may 
be noticed. Connie Casey correlates less well with the criterion test than 
does either of the others. Barry Black and Sam Smith, correlate with the 
criterion test to about the same extent. 

The remedial sections of all three case studies correlate with the 
criterion test to about the same extent. 


Second Administration of Connie Casey 


The case study tests used for the experiment described above were 
administered in mimeographed form. As a further check the Case of 


1 The populations were selected from existing classes studying adolescent behavior 
at two state and two private universities. If a given population number, as 100, was 
required and the available class population was in excess of 100, the extra cases were 
thrown out from the bottom of the pile of cases being scored. 
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Connie Casey was revised and printed. Connie Casey was chosen for 
printing because its coverage appeared to be wider than that of either 
of the other two cases. 

The printed case was presented to two groups. The first group, to be 
known as Group A, consisted of 47 randomly selected graduate students. 
All were experienced teachers, and all had had previous courses in psy- 
chology. 

Group B consisted of 69 randomly selected teachers college seniors. 
None were experienced teachers, but all had completed their practice 
teaching and were finishing their professional training. All had com- 
pleted a course in adolescent development during the previous school year. 

Coefficients of correlation were computed between the criterion test 
and each of the three sections of the revised Case of Connie Casey for 
Groups A and B. Table 2 shows the correlations between the criterion 
test and the case study for Groups A and B. 








Table 2 
Correlation Between the Printed Edition of Connie Casey and the Criterion 
Test for Groups A and B 
Group Group B 
Section (Graduates) (Ondararadeaten) 
Whole Test 28 + .136 -16 + .12 
Diagnosis 35 + .129 27 + .11 
Remedial 02 + .147 —.04 + .12 





Here, again, was a positive but slight relationship between the criter- 
ion test and the case study. In all cases the standard error for the popu- 
lations taking the revised test was greater than for that taking the original 
test because of the greater number in the original group. The following 
relationships emerge in a comparison of Groups A and B with the original 
population: 


1. The correlation of the whole test with the criterion tests was .02 
higher for Group A than for the original population. 

2. The correlation of the whole test with the criterion test was .10 
lower for Group B than for the original population. 

3. The correlation of the diagnosis section with the criterion test was 
-11 higher for Group A and .03 higher for Group B than for the original 
population. . 

4. The correlation of the remedial section with the criterion test was 
.24 lower for Group A and .30 lower for Group B than for the original 
population. 


oe oe Le ae 
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The whole and diagnosis increases could well be due to chance, but the 
remedial correlation drop would appear to be a significant one whose 
answer might possibly exist in the changed nature of the population or of 
the test. A drop of .30 is higher than could be explained by chance, even 
at three sigmas. 

In other words, for both Groups A and B the ability to make a diag- 
nosis correlated more highly with the criterion test than did the diag- 
nostic ability of the original population. For both Groups A and B the 
ability to suggest a remedial plan correlated much less highly with the 
criterion test than did the remedial ability of the original population. 

Results on the remedial section, for Groups A and B, lead to the con- 
clusion that the printed form of Connie Casey has certain slightly different 
potentialities than has the original mimeographed form. 

From the point of view of the present study, however, the differences 
found on the revised edition of Connie Casey are not significant. In no 
case has any correlation been as high as .50, and the trend of the results 
from the revised Connie Casey indicates less of a positive relationship than 
would have been inferred from the original correlations obtained between 
the criterion test and the three original case studies. The conclusion is 
inescapable that the case studies (whole, diagnosis, and remedial), while 
measuring certain aspects in common with the criterion test, are at the 
same time measuring something which the criterion test fails to measure, 
and vice versa. 


Comparison with Intelligence and Class Marks 


Though not directly pertinent to the purposes of this study, it was 
believed advisable to compute the correlation existing between the case 
study tests and intelligence, and between the cases and final marks in a 
course in adolescent psychology. The Ohio State University Psychologi- 
cal Test, Form 22, was administered to a randomly selected group of 61 
teachers college juniors, and the group was given the revised edition of 
Connie Casey. The relationship existing between the various sections 
of the Case of Connie Casey and intelligence are given in Table 3. 


Table 3 


Relationship Between Scores on Connie Casey and the Intelligence of a Selected 
Group of Subjects 








RRS UN UPIIED BINDING » 6 Si cicicicinn scp cacie veces tichsaee .23 +& .122 


OSPE vs. Diagnostic Section. ..................20005. 10 + .127 
OSPE vs. Remedial Section. .................seeeeees 25 + .121 











Results shown in Table 3 would appear to indicate a very small re- 
lationship between intelligence and the case study tests with this particu- 
lar group. 

It may be concluded from the foregoing correlations that the test is 
measuring factors other than intelligence, and that a more intelligent 
person might or might not do as well as a less intelligent one. Intelli- 
gence being, of course, in this case, as measured by the OSPE. 

At this point it might be asked if it would be possible for a dull normal 
person to receive as high a rating on the test as a superior person, since 
intelligence apparently plays a minor part. The answer is “no.” It 
must be remembered that the population involved was made up of suc- 
cessful college students, and therefore of persons of superior or near 
superior intelligence. The finding is, then, that given a basically good 
intelligence, added increments of intelligence in the superior range do not 
add to ability to succeed on the test in question. Form the foregoing 
correlations it might also be tentatively assumed that intelligence may be 
minimized as a factor in comparisons between the case study and criterion 
tests. 

A coefficient of correlation was secured between the final marks of 
59 teachers college juniors in a course in adolescent development and 
their scores on the Case of Connie Casey. The case was not used in 
determining final marks. The coefficient of correlation was .38 + .112, 
which shows a slight positive relationship. The coefficient of correlation 
found between 61 juniors’ marks in the same course and their scores on 
the criterion test was .50 + .095, and increase of .12 over the case study’s 
correlation with class marks. The difference is not significantly large, 
but for the people used in making the comparison there was more agree- 
ment between the criterion test and the final mark than between the 
case study and the final mark. In considering this relationship it must 
be remembered that the course grades were based on tests somewhat 
similar to the criterion test. This might help to explain the difference of 
.12 between the criterion test and the case study. 


Inter-Relationships between the Case Studies 


The question next arises as to the inter-relationships between the case 
studies. If it is to be assumed that each one is measuring ability to apply 
fact per se, then it would be expected that a high positive correlation 
would exist. On the other hand, in constructing the case studies, each 
case was made to deal with a different aspect of adolescent development, 
and the question might arise as to whether ability to apply knowledge 
about an emotional problem would indicate ability to apply knowledge 
about a social or physical problem. There is also the question as to 
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whether ability to apply knowledge about a school situation would indi- 
cate equal ability to apply knowledge about a home or community 
situation. If certain basic factors are involved, a positive, though not 
necessarily high correlation might be expected. 

As a matter of fact, with 67 cases the coefficient of correlation be- 
tween Barry Black and Sam Smith was .55 + .09. With 68 cases the 
correlation between Barry Black and Connie Casey was .39 + .10. The 
correlation between Sam Smith and Connie Casey was .62 + .09. 


Summary and Conclusions 


It has been the purpose of this study to find the relationship existing 
between knowledge of fact and principle about human development on 
the one hand, and ability to apply those facts and principles on the other. 
Four tests were constructed. First was a criterion test which measured 
knowledge of fact and principle about adolescent development. Second, 
three case study tests were constructed which measured ability to apply 
the facts and principles of adolescent development. 

The criterion test and the three case studies were administered to 300 
college upperclassmen and graduate students taking courses having to do 
with adolescent behavior. As a check a printed revision of one of the 
case studies together with the criterion test was administered to a group 
of 47 graduate students and to a group of 69 undergraduates. Inter- 
relations existing between the criterion test and the case studies were 
analyzed. Subsidiary analyses were made of the relationship existing 
between one of the case studies and intelligence, and final grades in a 
course in adolescent development. 

As a result of the study the following conclusions are tentatively made: 


1. It would appear that knowledge of facts and principles about 
adolescent behavior are positively but not highly related to ability to 
diagnose as measured by the case study tests. 

2. It would appear that knowledge of facts and principles about 
adolescent behavior are positively but not highly related to ability to 
identify appropriate remedial procedures as measured by the case study 
tests. 

3. Given intelligence enough to pursue college work, added incre- 
ments of intelligence appear to show a very slight positive relationship to 
ability to apply facts and principles of adolescent development. The 
same relatiosnhip holds true for ability to recognize the facts and princi- 
ples of adolescent development after having studied them in a college 
classroom. The group used in studying these relationships were, how- 
ever, a select group in that they had gone through a rigorous selection 
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program before being admitted to teachers college. The median percent- 
ile rank for the group (on the basis of the Ohio State University Psycho- 
logical Test Norms) was 79. No one was under the 50th centile. Hence, 
the above conclusion should be qualified by noting that the low relation- 
ship was found within a comparatively narrow segment of the range of 
intelligence. 

4. The relationship between success in a course in adolescent develop- 
ment as indicated by a final grade and ability to apply facts and princi- 
ples of adolescent development was positive but small as measured by 
the instruments used in this study. The ability to recognize facts and 
principles was not highly related to success in the same course, although 
it was greater than the ability to apply facts and principles. 

5. There does not appear to be an ability per se to apply facts and 
principles of adolescent behavior, but rather ability to apply various 
facts and principles to varying situations. It may be assumed, however, 
that there are common factors underlying the various specific abilities. 
This conclusion is drawn, in part, from the varying performances re- 
vealed on the three case studies, each dealing with different aspects of 
behavior. 

6. Insofar as traditional courses in human growth and development 
have used factual learning as the only method of preparing their students 
to diagnose and institute remediation in a life situation, they appear to 
have served their function less adequately than might otherwise be 
possible. The findings of this study may well challenge the assumption 
in any course in psychology or teacher education that knowledge of fact 
and principle necessarily leads to effective or intelligent application of 
fact and principle. 

7. It would appear from the foregoing conclusions that measurement 
devices traditionally used in teacher education and psychology do not 
satisfactorily measure ability to apply facts, however well they may 
measure knowledge of facts themselves. This conclusion is one that 
might well have been expected when one considers the results of the large 
number of studies between learning and transfer of training. 


Received November 16, 1945. 
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Keysort Method of Scoring the Minnesota Multiphasic 
Personality Inventory 


Capt. Morse P. Manson, AGD, and Capt. Harry M. Grayson, AGD 
Psychologists, MTOUSA Disciplinary Training Center 


The Minnesota Multiphasic Personality Inventory (MMPI), “a 
psychometric instrument designed ultimately to provide, in a single test, 
scores on all the more important phases of personality,’’! has been re- 
ceiving widespread clinical recognition and application. It contains 550 
different statements about the behavior, attitudes, and interests of the 
person being tested, each one appearing on a separate card. The ad- 
ministration is very simple, the examinee being required to sort the cards 
into three groups as it applies to him. However, one practical difficulty 
encountered in the use of the MMPI is the lengthy method of scoring. 
“. . . an enlisted man . . . trained in the work . . . usually required 
twenty to twenty-five minutes for the recording and scoring of a test.’ ? 
An improved method of scoring which minimizes clerical error, eliminates 
entirely the recording of individual items, and reduces the scoring time to 
less than ten minutes has been developed and used in the Personnel 
Evaluation Department of the MTOUSA ? Disciplinary Training Center. 


Rationale 


Examination of the MMPI scoring keys reveals that test items may 
appear on one or more keys. High scores are indicative of mental dis- 
turbance. Most of the items (299) are scored only if answered in a 
“deviate” or infrequent manner. These may be termed pure deviate 
items and appear as X-items on the original scoring keys. Other items 
(55) are scored only if answered in a “‘non-deviate” or frequent manner. 
These may be termed reverse deviate items and appear as O-items on the 
original scoring keys. A final group of 12 items, which are called XO 
items for identification, are scored on some subtests if answered in a 
deviate manner and on other subtests if answered in a non-deviate 


1 Hathaway, S. R., and McKinley, J. C. Manual for the Minnesota multiphasic 
personality inventory. N. Y.: The Psychological Corporation, 1943. 

* Leverenz, Major C. W. Minnesota Multiphasic Inventory, an evaluation of its 
usefulness in the psychiatric services of a station hospital. War Medicine, 1943, 4, 
618-629. 

* Mediterranean Theater of Operations, United States Army. 
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manner. They may be termed mized items and appear either as X- or as 
Q-items on the different original scoring keys. These three types of items 
require different treatments in scoring.‘ 

In the original MMPI scoring method, the CANNOT SAY items are 
entered in the appropriate cells on the score sheet with a heavy diagonal 
line and are then discarded. The TRUE and the FALSE groups are 
conveniently separated into deviate and non-deviate packets and only the 
deviate answers are used in recording scores. For the cards filed as TRUE 
items, this group has the lower righthand corners cut. For the cards 
filed as FALSE items, the lower lefthand corners are cut. 

The deviate cards are recorded on the score sheet by an X for each 
item placed in the appropriate cell. The various subtest scoring keys are 
then applied, and each item on the score sheet appearing next to an X on 
the key is counted. Each blank item appearing next to a O on the key is 
likewise counted. The CANNOT SAY items are not credited on any of 
the keys, but their diagonal line entries are added separately to yield a 
Question (?) score. 

The improvements in scoring the MMPI involve an adaptation of the 
McBee Keysort System, used by the Army in connection with WD AGO 
Form No. 20, the Soldier’s Qualification Card. This system permits the 
rapid selection of cards for any desired trait or combination of traits. 
The four sides of each card are lined with holes, each of which represents 
a number, trait, or specified item, identified and coded. To indicate the 
possession of designated characteristics, the proper holes are cut out with 
a U-shaped notch between the hole and the edge of the card. The 
insertion of a rod or needle through the proper hole in a stack of cards will 
release all notched or desired cards. 

As applied to the MMPI, the cards are notched in a similar fashion, 
enabling their rapid removal (and scoring by simple count) for each of the 
subtests. (See Figure 1.) 

The coded holes for the subtests have been arranged counter-clock- 
wise on the reverse side of the cards, following the order of the MMPI 


* A large group of items (184) play no part in the scoring on any of the subtests, and 
are counted only in the CANNOT SAY score. These may be of ultimate significance 
on new or revised keys. At present, however, they unduly increase test administration 
and scoring time and may be set aside without invalidating the test to any extent. The 
twelve XO items, most of which appear on the Hypomania Scale, could also safely be 
eliminated without harming the test, at the same time greatly simplifying the scoring 
process. Perhaps a better procedure would be to have the XO cards scored only as X 
or as O cards depending upon the major role of the particular items. This simplified 
scoring would more than compensate for the imperceptible change in norms over the 
various subtests. However, the scoring procedure described in this paper is based on 
the use of all 550 cards. 
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score sheet, except that there is no hole for the CANNOT SAY (?) items 
since the question score is obtained merely by counting the cards. The 
order of tests on the MMPI score sheet is as follows: (see Figure 2). 





1. ?—Question score 7. Pd—Psychopathic deviate 
2. L—Lie score 8. Mf—Interest (masculine-fem- 
inine) 
3. F—Validity score 9. Pa—Paranoia 
4. Hs—Hypochondriasis 10. Pt—Psychasthenia 
5. D—Depression 11. Se—Schizophrenia 
6. Hy—Hysteria 12. Ma—Hypomania 
ey sso Cutout om 
Saacce «=©6lDlr CULE 
O 
O O 


I have never been in trouble with the law. 


O O 











Bee 2G)o. 4 
Cees ( Psychopathic Deviate 
Fig. 1. Typical card, front view. 


Method 


All MMPI Cards are cut as listed in a Master Coding Chart.’ An 
off-centered triangular cut at the top of each card aids in their proper 
alignment and prevents possible reversal from the upright, front-face 
position. The conversion from the original system to the punched-card 
system of scoring is made as follows: 

a. The X-items or pure deviate items are notched for their respective 
subtest holes in accordance with the original scoring keys. 

b. The O-items or reverse deviate items, since they are scored only if 
answered in a frequent or non-deviate fashion, have the TRUE and the 


* Copies of the Master Coding Chart may be obtained from Morse P. Manson, 
8212 Blackburn St., Los Angeles 36, California. 
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Fig. 2. Guide card (reverse view). (Note: The guide card can be used to aid in 
accurate needling of the cards. This card is always placed on the reverse side of the 
stack of cards to be needled.) 
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Fic. 3. XO card (front view). (Note: The letters “d” and “n” appear only on 
reverse side of every XO card.) 
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FALSE holes notched in the opposite way from the X-items. (These 
TRUE and FALSE holes are in place of the cut-off corners on the cards.) 
For example, O-items receiving credit if answered as TRUE, for scoring 
purposes, are notched as if answered as FALSE. Conversely O-items 
receiving credit if answered as FALSE, for scoring purposes, are notched 
as if answered as TRUE. (See Master Coding Chart and Table 1.) 

c. The XO items, which on one test are scored as X-items or pure 
deviate items and on another test are scored as O-items or reverse deviate 
items are separated by needle into two groups, an XO DEVIATE and an 
XO NON-DEVIATE group. The XO hole appears under the alignment 
cutout as shown in Figures 2 and 3. The items in the XO DEVIATE 
group are credited for X’s, since in this group they are scored only as 
pure deviate (X) items. The items in the XO NON-DEVIATE group 
are credited for the O’s, since in this group they are scored only as reverse 
deviate (O) items. 

For illustrative purposes, the manner of notching the TRUE and 
FALSE holes for the three different types of items, deviate (X), non- 
deviate (O), and mized (XO) is described in the Master Coding Chart and 
Table 1. 

Table 1 


Illustration of the Notching of TRUE and FALSE Holes for the Three 
Different Types of Items 











Subtest Cut Hole 
Item Items Corner Punched 





A-l Hs, D, Hy left FALSE 

Hs, Hy right TRUE 
A-3 D, Hy right FALSE 
B-6 D left TRUE 
B-3 D, Ma right TRUE 
B-8 Hs, D left FALSE 





It can be seen that the pure deviate or X-items and the mized or XO 
items have the TRUE hole notched where the cards are cut in the lower 
righthand corner, and have the FALSE hole notched where the cards are 
cut in the lower lefthand corner. The reverse deviate or O-items, on the 
other hand, have the TRUE and FALSE holes notched in exactly op- 
posite fashion. 

As for the subtests, the pure deviate (X) and the reverse deviate (O) 
cards have the holes notched for each subtest on which they appear. The 
mixed (XO) items have a double, or paired, set of holes around the edge 
of the card for each of the subtests, identified by the letters d and n. 
Where scored as a deviate or X-item on a given subtest, a notch is cut 
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through the d hole on the card. Where scored as a O or non-deviate item 
on a given subtest, a notch is cut through the n hole on the card. (See 
Figure 3.) 


Procedures 


The test is administered exactly as described in the MMPI Manual 
resulting in the same three stacks of cards: TRUE, FALSE, CANNOT 
SAY. 


1. The CANNOT SAY cards are counted and entered in the ? space 
on the score sheet and then discarded. 

2. The TRUE and FALSE stacks of cards are turned face down so 
that the needle enters through the blank side. 

3. The needle is inserted through the TRUE hole on the TRUE 
stack of cards. The deviate cards drop out and are placed in a deviate 
group. The non-deviate cards remain on the needle and are placed in a 
non-deviate group. 

4. The needle is inserted through the FALSE hole on the FALSE 
stack of cards. The deviate cards drop out and are added to the cards in 
the deviate group. The remaining cards are added to the cards in the 
non-deviate group. There are now two stacks of cards, deviate and non- 
deviate. 

5. The needle is inserted through the XO hole in the deviate stack, re- 
leasing the XO DEVIATE cards, which are placed in a separate group. 
The cards remaining on the needle is the deviate stack. 

6. The needle is inserted through the XO hole in the non-deviate 
stack, releasing the XO NON-DEVIATE cards, which are placed in a 
separate group. The non-deviate cards which remain on the needle are 
no longer used in scoring and are added to the CANNOT SAY cards 
which already have been discarded. 

7. There are now three kinds of cards to score: (a) DEVIATES, (b) 
XO DEVIATES, (c) XO NON-DEVIATES. The cards are ready for 
the final scoring of the individual subtests. 

8. The needle is inserted through the subtest holes of the DEVIATE 
group in counter-clockwise manner, beginning with the L hole. The 
cards which drop out are counted and then returned to the DEVIATE 
stack. Cards needled out of any stack, after being counted, are always 
returned to that stack prior to scoring the next subtest. 

9. The twelve XO cards are scored in different fashion. For each of 
the subtests there is a double or paired set of holes around the edge of 
these cards. On the back or blank side of the card the letters d and n 
appear under the holes. Those cards which are scored as X or deviate 
items on a given subtest have the d hole notched, Those cards which 
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are scored as O or NON-DEVIATE items on a given subtest have the n 
hole notched. For each subtest in the XO DEVIATE group, the needle 
is inserted in the d hole and those cards which fall off the needle are 
counted. For each subtest in the XO NON-DEVIATE group, the 


True False 


Te; ‘oe! 
oS SS 3 

















co oe e° °¢ 


Deviate § Non-Deviate Non- Deviate 




















Non-Deviate 
(Discard) — 














° 

° 

° °o ° 

XO Deviate Deviate XO Non-Deviate 
Fie. 4. Steps in breakdown. 


needle is inserted in the n hole and similarly those cards which fall off the 
needle are counted. 

10. The same procedure is gone through for each of the subtests. 
The score is always the sum of the number of cards dropped from the 
DEVIATE, the XO DEVIATE, and the XO NON-DEVIATE stacks. 
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In summary, the first needle movement, through the stack of TRUE 
cards, removes all cards which are placed in a DEVIATE group, the 
remainder being placed in a NON-DEVIATE group. The second needle 
movement, through the stack of FALSE cards, removes all cards which 
are added to the DEVIATE group, the remainder being added to the 
NON-DEVIATE group. The third needle movement, through the 
DEVIATE group, extracts the XO DEVIATE cards. The fourth 
needle movement, through the NON-DEVIATE group, extracts the XO 
NON-DEVIATE cards. The remaining stack is discarded. The sub- 
test scores are obtained from the three stacks: DEVIATE, XO DEVIATE, 
XO NON-DEVIATE. (See Figure 4.) 

The MMPI Manual lists twenty-six categories into which the cards 
have been classified and gives the number of items that appear in each 
category. It would be a simple matter to punch the cards, and code the 
holes, for the various categories, so that by the use of a needle the clinician 
could identify deviate and non-deviate answers in any area of special 
interest with regard to the particular patient. 

Although the punch card method may appear to be complicated from 
the foregoing description, it is actually very simple in practice. One or 
two scoring attempts will demonstrate the simplicity, speed, and accuracy 
of this scoring method. The punch card technique described here can 
be adapted to the scoring of any tests which make use of, or can be con- 
verted to, a card system of administration. 

The punch card system need not be limited to tests of personality. 
It can be applied equally well to any multiple-choice test or question- 
naire. Test batteries of various types can be readily compiled and 
rapidly separated and scored by needling. Specific subtest areas, com- 
mon to several tests, can be combined and analyzed. For example, all 
vocabulary items on several tests, or all spatial relations items, or all 
problem-solving items can be isolated, unified, and totally treated. New 
composite inter-test norms of increased reliability can be derived. 

This method makes possible the rapid scoring of tests, enabling the 
clinician to devote more time to diagnosis and therapy. It enables him 
to use more tests and to make a quicker selection of specific items in 
interested areas of personality. 

The elimination of paper and pencil, as provided by the MMPI, can 
be carried to the administration of many other tests. It aow is practical 
for the examiner to reduce the administration of tests to a single instruc- 
tion at the beginning, as in the MMPI, and to a rapid, simplified scoring 
at the end, as in this method of scoring. 


Received November 9, 1945. ° 





Profile Analysis of the Minnesota Multiphasic Personality 
Inventory in Differential Diagnosis * 


Paul E. Meehl 
University of Minnesota 


A personality test may be employed in several kinds of clinical situ- 
ations. These include: overall differentiation of normals from abnormals 
or persons predisposed to abnormal developments as in “‘screening’’ in 
the military, industrial, educational, or general medical out-patient 
situation; differential diagnosis among abnormals; prognosis; evaluation 
of changes and results of therapy; and the assessment of certain compon- 
ents for other than strictly diagnostic purposes such as the detection of 
important paranoid trends in a reactive depression even though diagnosis 
presents no problem. 

The present paper presents preliminary data on the use of the Minne- 
sota Multiphasic Personality Inventory (MMPI) with respect to differ- 
ential diagnosis, with secondary findings upon the subject of overall 
identification of “abnormals’”’ from people in general. Since MMPI has 
been described elsewhere (4), we may merely state that this device is a 
structured personality test which yields scores on nine components of 
abnormality, namely Hs (Hypochondriasis), D (Depression), Hy (Hys- 
teria), Pd (Psychopathic deviate), Mf (Masculinity-femininity), Pa 
(Paranoia), Pt (Psychasthenia), Sc (Schizophrenia), and Ma (Hypo- 
mania). In addition, there are four scores which indicate “validity,” in 
the sense that they attempt to detect test-records which for reasons such 
as confusion, language difficulty, or non-cooperation cannot be accepted 
as adequate samples of the patient’s verbal behavior. These scores are ? 
(Cannot say), L (‘Lie’), F (carelessness and misunderstanding), and a 
recently devised suppressor called K(5). For the details of function and 
interpretation of these validity indicators, the reader is referred to previ- 
ous articles. In the present study, the scale Mf (Masculinity-femininity), 
has been excluded from consideration throughout, so that there are only 
eight personality components involved. li of the scores are expressed as 
T-scores, the general normal sample having a mean of 50 and a 8.D. of 10. 

* This article is a “prior publication,” the author paying complete costs. The 
scheduled 80 pages per issue is thereby increased by the corresponding amount, thus the 
“early publication” of this article is a direct contribution to the subscribers of the 


Journal of Applied Psychology without handicap to those authors whose articles are 
accepted and printed in their regular turn. 
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The purpose of the present study was to evaluate MMPI as used in 
the differential diagnosis of three main categories of hospitalized psychi- 
atric patients: psychosis, psychoneurosis, and “conduct disorder.” 
Gough (2) and Schmidt (6) have stressed the importance of considering 
the “pattern” or configuration of the profile in addition to the elevation 
of single scores. An elevation on a single component, even if it is the 
highest or “peak” score of the profile, does not imply that the patient 
should be so diagnosed. For example, the most frequent peak score on 
abnormal profiles of all sorts is D (Depression). It is clinically known 
that many different kinds of psychiatric difficulties involve degrees of 
depression, and the test reflects this fact. Again, a peak of 75 on Sc 
might suggest a schizophrenic picture, whereas if it occurs together with 
markedly elevated scores on the neurotic triad (Hs, D, Hy) and a Pt of, 
say, 85, it may better be taken to indicate a psychoneurosis with poor 
prognosis (3). It must be emphasized that the patterning of a profile 
cannot be neglected in the case of structured tests any more than we 
would think of interpreting one determinant column of the Rorschach 
without considering anything else. 

As yet, these configurational criteria on MMPI have not been ade- 
quately treated in the literature. Locally, the Minnesota group have 
tended to form more or less crude clinical judgments and global impres- 
sions based upon accumulated experience. The articles by Schmidt and 
Gough have contributed materially to the objectification of procedure, 
although neither of these investigators published results in the form of 
percent correct identifications for clinically diagnosed groups, a kind of 
treatment which is in many ways more meaningful than establishment of 
significant differences between central tendencies (1, p. 19). Further- 
more, in both of these articles the similarity of “psychosis” to ‘‘severe 
psychoneurosis” in MMPI profile is too close for comfort, a drawback of 
MMPI which has been informally reported by a number of military 
clinical psychologists through personal communications. 

In the present investigation, an attempt has been made to determine 
the approximate accuracy of a very rapid, inspectional diagnosis from the 
MMPI profile alone, using the more or less poorly defined criteria which 
have so far seemed valuable clinically. Naturally, it is not suggested 
that the profile be used in this way, but we want to know how much the 
test can contribute entirely on its own when so used. Because of the fact 
that recently hospitalized cases were not diagnosed independently of 
MMPI, it was necessary to utilize old cases, before July 1941, on whose 
response sheets the present scales had been subsequently scored. At 
that time, the MMPI had not been published and was still in process of 
development. The only scales which appeared on the profiles then in 
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use were H (a relatively less valid, uncorrected key for hypochondriasis) 
and D (Depression). For all practical purposes, it may be assumed that 
the clinical diagnoses made on these cases at that time were almost wholly 
unaffected by the presence of these scores on the chart. Of course, none 
of the present “‘pattern’’ criteria could have been employed at that time; 
further, knowledge of and confidence in the test were negligible among the 
psychiatric staff. i 

The procedure of blind diagnosis was as follows: profiles of male 
abnormals were leafed through in the order of their appearance in the 
files (roughly chronological). Any profile showing a ? (Cannot say) or 
L (“Lie”) score as great as 70 was recorded as “invalid,” except that if 
any abnormal score reached a standard score of 80, an elevated L score 
was ignored, since defensive lying could hardly be the reason for such a 
positive elevation. F was allowed to reach a raw score of 16 (T = 80) 
before the profile was considered invalid. The terms “valid” and “‘in- 
valid” are used hereafter to indicate the acceptability of the profile as an 
adequate measure in terms of ?, L, and F, and have no reference to the 
question of accuracy of identification. When it had been decided that a 
profile was ‘‘valid’’ by these criteria, it was classified as either normal or 
abnormal. Actually, of course, it was known that all of the cases were 
abnormal, so that the criteria of classification had to be made wholly 
objective and hence more rigid than would be the case in practice. Pro- 
files were called abnormal under the following four conditions: 


1. Any of the eight components showed T > 90. 

2. Any of the eight components showed T > 80, unless K < 40.! 

3. Any of the eight components showed T > 70, unless K < 50 and 
L < 60. 

4. Any of the eight components showed T > 65, unless K < 65 and 
L < 60. 


It can be seen from the above criteria that the classification into 
normal or abnormal is a matter of spotting the highest T-score, then 
reading to the right to see if the restrictions on K and L throw the profile 
into one group or the other. The profiles consist wholly of MMPI scores 
and a code number, so that there is no other source of information in 
making the diagnosis. 

1The scale K is a correction scale or suppressor variable which may be used to 
correct for certain test-taking attitudes which tend to invalidate a record. If the K 
score is low, it indicates that the testee was overly self-critical and obtained spuriously 
abnormal scores, hence is probably less deviate than his profile suggests. If K is high, 


it indicates a defensive tendency, and suggests that the profile is too low—a more subtle 
form of the old L scale. For further discussion see (5). 
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Application of these criteria to 294 profiles from our general population 
male sample’ yields 10% “invalid” records on the basis of ?, L, and F 
scores set as above. Of the records which can be accepted as valid, there 
are 9% indicative of abnormality by the criteria, which may be con- 
sidered the uprer limit of ‘false positives.” Actually, of course, an un- 
known proportion of these false positives are profiles of persons who, 
although not under psychiatric care at the time of testing, were at least 
as psychiatrically deviant as some of the hospitalized abnormals. The 
figure 9% is to be contrasted with the 3% to 5% found previously for 
single scales. It is to be kept in mind in what follows that the differentia- 
tions achieved among the hospitalized abnormals occur at the expense of 
almost 1 in 10 among the normal population. The remainder of this 
paper deals only with the differentiation among actual abnormals. 

When a profile had been classified as abnormal by the above criteria, 
a quick inspectional classification was made using three categories. The 
three employed were psychosis, psychoneurosis, and “‘conduct disorder.”’ 
The last category is used to cover cases diagnosed constitutional psycho- 
pathic inferior, psychopathic personality, criminalism, alcoholism, ex- 
cept psychoses or deterioration, simple adult maladjustment, or “pri- 
mary behavior disorder” such as the adolescent conduct problems not 
otherwise classified. The criteria employed in this subdivision of ab- 
normal records were intentionally vague and subjective, since it was this 
sort of inspectional judgment which was to be evaluated. No “computa- 
tions” of any sort were performed on the scores. In general, the criteria 
insofar as they were explicit, were those described by Schmidt and Gough, 
and such personal impressions as the examiner had acquired from con- 
siderable clinical work with MMPI. Psychosis was suggested by 
markedly elevated profiles, high F, Sc greater than Pt, Pa or Ma markedly 
elevated, the “psychotic” (right-hand) end of the curve reaching the 
level of the “‘neurotic’”’ (left-hand) end, or a distinct spike on D, with the 
Hs and Hy scores on either side falling far below the D. Psychoneurosis 
was suggested by a less elevated profile, lower F, Pt greater than Sc, Pa 
and Ma not much elevated, the neurotic triad clearly elevated more than 
the rest of the curve, and the three scores of the triad closer to one an- 
other. Conduct disorder was suggested by elevations on Pd, Ma if not 
too high and especially with secondary peak at Pd, neurotic triad low 
except for some Hy, psychotic end running about 60. The examiner 
restricted himself to 10 seconds per profile in making his decision, and in 
most cases the judgment was made in less than five seconds. After hav- 
ing made the classifications, these weve compared with the diagnoses of 


* The data are for males only. 
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the psychiatric staff. All cases were eliminated in which the staff diag- 
nosis was indicated as highly questionable or based upon insufficient study 
or cases of organic C.N.8. disease or feeblemindedness. The actual 
composition of the abnormal group as subsequently determined was 
as follows: Psychosis, 57 cases (Schizophrenia 26, Manic-depressive 21, 
Paranoid condition 8, and Involutional melancholia 2); Psychoneurosis, 
53 cases (Hypochondriasis 14, Hysteria 13, Reactive depression 9, Psy- 
chasthenia 7, Anxiety state 5, Mixed or unspecified 4, Neurasthenia 1); 
and Conduct Disorder, 37 cases (Psychopathic personality 21, Psycho- 
pathic personality pathological sexuality 8, Alcoholic 5, Behavior dis- 
order 2, Adult maladjustment 1). 

Of this entire group of 147 clinical abnormals, 25 (17%) invalidated 
their records on the basis of the validity indicators. Seventy-eight (53%) 
were correctly called abnormal, while the remaining 44 cases (30%) were 
(erroneously) classified as normal. The following table represents the 
data in various convenient breakdowns: 


Table 1 


Classification by Profile Inspection of 147 Records of Hospitalized Abnormals (Male) 
A. Percentages based upon all 147 records 








Called Called Invalid 
Abnormal Normal Record 


Total group (N = 147) 53% ' 30% 17% 
Psychotics (N = 57) 60% 21% 19% 
Neurotics (NV = 53) : 47% 36% 17% 
Conduct disorder (N = 37) 51% 35% 14% 








B. Percentages based upon the 122 valid records (based on ?, L, F scores) 





Called Called 
Abnormal Normal 





Total group (N = 122) 64% 36% 
Psychotics (VN = 46) 74% 26% 
Neurotics (N = 44) 57% 43% 
Conduct disorder (N = 32) 59% 41% 





C. Percentages based upon the 78 cases called abnormal 





Called 
Called Called Conduct 
Psychotic Neurotic Disorder 





Psychotics (N = 34) 56% 29% 15% 
Neurotics (NV = 25) 24% 68% 8% 
Conduct disorders (N = 19) 16% 16% 68% 
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From these tables, we see that in employing a criterion of abnormality 
which holds our false positives down to one in ten among general normals, 
we are able to detect only about half of the known abnormals (Table 1, 
A). This figure is not quite fair to the test, however, since those with 
invalid records would not under these conditions be erroneously classified 
as normal, but would either be requested to take the test over again with 
more precautions or their profiles disregarded. If we confine our atten- 
tion to records in which the validity indicators are satisfactory, we find 
that about two-thirds of the abnormals can be identified (Table 1, B). 
It should be pointed out that the disappointingly large proportion of 
apparently invalid records among the abnormals (about one-sixth of all 
the records) is in part due to the time at which these tests were adminis- 
tered. At that time patients were allowed to invalidate their testings 
by sorting large numbers of the cards into the “Cannot say” category, 
sorting the cards at random, and so on. More systematic supervision 
now eliminates many of these uninterpretable profiles. 

Setting up a contingency table for the 78 cases correctly classed as 
abnormal, we obtain a chi-square of 34.016, which with 4 df. is highly 
significant (P < .001). This corresponds to a contingency coefficient of 
.55, with the upper limit possible for a 3X 3 table being .82. 

In comparing the accuracy of identification for the three diagnostic 
groups, we shall consider only the valid testings, since the percentages of 
invalid records differ insignificantly among the three. While three- 
fourths of the psychotics were identified as being abnormal as contrasted 
with between one-half and three-fifths of the neurotics, a test of signifi- 
cance in proportion of false “normals” in the three diagnostic categories 
fails to show a significant difference (Chi-square 3.233, 2 d.f., P > .14). 
This being the case, most of the further subgroup differences in identifica- 
tion were not statistically analyzed. Mere inspection of Table 1, C, 
however, would suggest that the chief confusion occurs between neurotic 
and psychotic curves, rather than between either of these and the class of 
conduct disorders. Once having correctly classed a profile as being ab- 
normal, the probability of its being thrown into the appropriate one of the 
three categories is about two in three. 

Detailed inspection of the table of actual-real classifications does not 
indicate much because of the small numbers of cases in various sub- 
categories. In the case of the psychoneuroses, however, inspection sug- 
gests that some clinical subgroups are more likely to show apparently 
“normal’’ profiles than are others. The differences in proportion called 
abnormal were tested by grouping the cases into four classes: hypochon- 
driasis, hysteria, psychasthenia, and all others, and running a chi-square 
test on the resulting 4 X 2 table. This chi-square was barely significant 
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at the 5% level (Chi-square 8.303, 3 d.f., P < .046). Inspecting the 
table for the source of the differences, we find that 11 of the 13 hypochon- 
driacs were identified as abnormal, as compared with only half of the ten 
hysterias, and only one of the six compulsives. It has been recognized 
for some time that the Pt scale is relatively ineffective clinically, and the 
use of K as a suppressor for Pt in this crude way tends to increase the false 
negatives by leading to an under-interpretation of profiles because K is 
highly correlated (negatively) with Pt. More detailed treatment of the 
actual subcategory tables is not warranted by the numbers involved. 


Summary and Conclusions 


The adequacy of the MMPI in differential diagnosis employing a 
rapid, inspectional method of pattern analysis of profiles was investigated 
by making “blind” diagnoses from records of 147 hospitalized psychiatric 
cases into three major categories of psychosis, psychoneurosis, and con- 
duct disorder. The criterion was the clinical diagnosis of the psychiatric 
staff, made at a time when the present scales of MMPI, with one excep- 
tion, were not yet in existence. The findings were as follows: 


1. Setting up arbitrary criteria for the overall distinguishing of 
normal from abnormal persons, we find that about 1 in 10 persons from 
the general population sample is called abnormal (false positive). 

2. Approximately 2grds of actual abnormals are identified as such by 
these criteria, if we exclude records obviously invalid on the basis of the 
validity indicators ?, L, and F. 

3. Of the abnormal cases identified as abnormal, about 24rds are 
placed in the appropriate category of the three employed. The contin- 
gency coefficient for the agreement between blind diagnostic grouping 
and the actual diagnosis is .55. 

4. There is a suggestion that some varieties of abnormality are more 
readily identified than others. Hypochondriasis is fairly easily identified, 
whereas hysteria and psychasthenia are less so. 


In general, while the discriminations achieved are very much better 
than chance in the statistical sense, especially considering the fact that no 
skilled clinical time is involved in giving or scoring the test and less than 
10 seconds was used here in “interpreting” it, it must be admitted that 
the proportion of false classifications is considerable. Two developments 
can be expected to reduce materially this margin of error: first, the more 
mathematically precise utilization of the suppressor K; second, the greater 
formalization of pattern interpretation. 


Received July 6, 1948. 
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The K Factor as a Suppressor Variable in the Minnesota 
Multiphasic Personality Inventory * { 


Paul E. Meehl and Starke R. Hathaway 
Division of Psychiatry and the Department of Psychology, University of Minnesota 


I. History and Problem 


Among the very large number of structured personality inventories 
which have been published, it is by now quite generally admitted that 
there are relatively few which are of practical value in the clinical situ- 
ation. There are a number of reasons, both obvious and subtle, for this 
fact, some of which will be developed by implication in the present paper. 
One of the most important failings of almost all structured personality 
tests is their susceptibility to “faking” or “lying’’ in one way or another, 
as well as their even greater susceptibility to unconscious self-deception 
and role-playing on the part of individuals who may be consciously quite 
honest and sincere in their responses. The possibility of such factors 
having an invalidating effect upon the scores obtained has been mentioned 
by many writers, including Adams (1), Allport (2) (3) (4), Bernreuter 
(7) (8) (9), Bills (10), Bordin (11), Eisenberg and Wesman (15), Guilford 
and Guilford (18), Humm and Humm (31), Humm and Wadsworth (29), 
Kelly, Miles and Terman (32), Laird (33), Landis and Katz (34), Maller 
(39), Olson (51), Rosenzweig (53) (54), Ruch (55), Strong (58), Symonds 
(59), Vernon (62), Washburne (63), Willoughby (66) and others. One of 
the assumed advantages of the projective methods is that they are re- 
latively less influenced by such distorting factors, although this assump- 
tion should be critically evaluated. 

The existence of a distorting influence in test taking attitude is so 
obvious that it has hardly been thought necessary to establish it experi- 
mentally, although a number of investigations have demonstrated the 
effect. Frenkel-Brunswik (16) investigated tendencies to self-deception 
in rating oneself, finding in some cases marked negative relations between 
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self-judgments and the evaluation of others. Hendrickson (27), cited by 
Olson (51), reported that a group of teachers earned significantly more 
stable, dominant, extroverted and self-sufficient scores on the Bernreuter 
scales when instructed to take the test as though they were applying for a 
position, than when under more neutral instructions. Ruch (55) showed 
that college students could fake extroversion on the Bernreuter to the 
extent of achieving a median at the 98th percentile of Bernreuter’s norms, 
as contrasted with a ‘‘naive” median at the 50th percentile. Bernreuter 
(8) found that college students could produce marked shifts in their 
Bernreuter scores in the “‘socially approved’ direction, although he inter- 
preted this finding as indicating the comparative unimportance of the 
faking tendency. His reasoning was that had the need for giving socially 
approved responses operated in the first administration to any appreci- 
able extent, the effect of special instructions to take this attitude should 
not have been great. This reasoning seems rather tenuous, inasmuch as 
the occurrence of a shift merely shows that conscious and permitted 
faking can produce greater effects than those which may have been 
operating in the “naive” original testing. The insignificant correlations 
between naive and faked scores were also used by Bernreuter to support 
his view, an argument which is not comprehensible to the present writers, 
especially in view of the probably gross skewness of the faked scores. 
What is clear from his investigation is that people are able to influence 
their scores to a considerable extent if they choose to, and that the aver- 
age student’s stereotype of what is “socially desirable” seems to be an 
individual who is dominant, self-sufficient, and stable. Maller (39), 
Metfessel (49), Olson (51) and Spencer (56) have studied the effects of 
anonymity on responses to self-rating situations and shown that the re- 
quirement of signing one’s name has a definite effect on the scores. 
Kelly, Miles and Terman (32) demonstrated the great ease with which 
scores on the Terman-Miles Masculinity-Femininity Test could be “faked” 
in either direction once the subjects had been let in on the secret of what 
the test measured. Strong (58), Bills (10), Steinmetz (57), and Bordin 
(11) have presented evidence of the ability of subjects to distort their 
interest patterns when taking the Strong Vocational Interest Blank. 

It is a significant socivlogical fact about the psychologists that in 
spite of the strong reasons, both a priori and experimental, for accepting 
the reality of this phenomenon in objective personality testing, very few 
systematic efforts have been made to correct for it or to overcome it. In 
published articles one continually finds brief and inadequate references 
to the “assumption of frankness” and the necessity for arousing a ‘“‘sincere 
desire to know oneself better,’ but the treatment is usually extremely 
sketchy and no very concrete suggestions are given for producing such 
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test-taking attitudes nor, what is almost as important in practice, for 
determining the extent to which they have been present. It almost seems 
as though we inventory-makers were afraid to say too much about the 
problem because we had no effective solution for it, but it was too obvious 
a fact to be ignored so it was met by a polite nod. Meanwhile the scores 
obtained are subjected to varied and “precise” statistical manipulations 
which impel the student of behavior to wonder whether it is not the aim 
of the personality testers to get as far away from any unsanitary contact 
with the organism as possible. Part of this trend no doubt reflects the 
lack of clinical experience of some psychologists who concern themselves 
with personality testing, and the very strong contemporary trend which 
stresses the statistical interrelationships of item responses much more 
than the relation of the latter to external non-test criteria. The estab- 
lishment of “‘validity’’ (sic!) in terms of various criteria of internal con- 
sistency naturally leads to an unconscious neglect of the problem of non- 
test behavior correlates. 

Among the many authors who recognize the problem there are a few 
who have made specific suggestions for its solution. The inclusion of 
special exhortations to frankness and objectivity in the test directions 
themselves is common, but we have no evidence as to its effectiveness. 
Obviously, if a subject is consciously determined to fake, he will do so; 
whereas if his motivation to distortion is of a more subtle, non-verbalized 
nature, such exhortations can hardly be expected to be efficacious. An- 
other method is to attempt disguise of the items, so that the “signifi- 
cance”’ of a given response is less obvious. Traditional approaches to the 
measurement of personality render this technique practically impossible, 
inasmuch as the items are selected to begin with for their obvious psycho- 
logical significance and hence unless changed so greatly as to no longer 
elicit the desired information, almost inevitably continue to betray their 
origin. An effective use of a set of “‘subtle’”’ items is only possible when 
the initial item pool is very large and the initial selection (not only the 
final validation) of items is ruthlessly empirical. Those items whose 
significance would not have been guessed by the test-maker will then be 
equally mysterious to the testee. When the projective and role-playing 
components of test-taking behavior are clearly seen to be present in 
objective personality inventories (46), this approach to the problem is 
very fruitful. A simple strategem along the item-disguise line is to state 
about half of the items negatively, so that an affirmative response is not 
consistently a “bad” or maladjusted one. However, such techniques 
cannot eliminate the problem entirely. 

A spurious anonymity using secret coding for identifying the testee 
is a possibility suggested by the studies cited above, but is clinically 
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impractical for obvious reasons. The deception involved is not desirable, 
and in any case the clinical patient, unlike the sophomore student, knows 
perfectly well that the examiner is interested in his score individually. 
Lacking anonymity, it has been suggested by Olson (51) that the name be 
signed at the conclusion of the administration instead of at the top of the 
page. This suggestion was carried into practice by Maller (40) in his 
Character sketches. This investigator also stated the questions in the 
‘fndirect” (third person) form, requiring the subject to indicate whether 
he was the same or different from the person described. Maller presents 
evidence that this procedure aroused considerably less annoyance in his 
subjects, although direct proof that this decrease in annoyance led to 
increased validity is lacking. For reasons which have been given in 
more detail elsewhere (46), it is doubtful whether the removal of personal 
reference is wholly desirable; since there is reason for believing that the 
same role-playings and self-deceptions which operate to invalidate some 
of our measurements are an important factor in making other measure- 
ments possible. 

Another technique for reducing the effect of signing one’s name is to 
have the items printed on cards which are then sorted by the subject, 
making all writing unnecessary and possibly lessening the feeling that one 
is making a permanent record of his personal failings. This has been 
done by Maller in a revised test (Personality Sketches) and by Hathaway 
and McKinley in the Minnesota Multiphasic Personality Inventory (26). 
The latter test will be referred to as MMPI. 

Although all of these strategems may have a considerable value, 
especially in the aggregate, the fact still remains that they do not by any 
means remove the possibility of “faking.” What is much more import- 
ant, they are mainly directed at the sort of conscious falsehood which most 
writers have stressed, while ignoring the more subtle tendencies to self- 
deception which are probably of even greater importance in affecting 
scores. In the third place, they neglect to stress the existence of trends 
in the opposite direction—namely, those trends which exaggerate the 
apparent abnormality or maladjustment of the individual rather than 
soft-pedaling it. It is only natural that the tendency of a testee to put 
himself in a favorable light should have received more attention than the 
contrary tendency, which makes much less ‘‘sense’’ psychologically at 
least from a superficial point of view. There is evidence that this latter 
tendency does exist, however, and that it is a much more important 
factor in determining scores on personality inventories than has generally 
been supposed. Some of this evidence will be presented in the present 
paper, while other indications have been given elsewhere (47). It is also 
probable that certain systematic differences in item-interpretation, not 
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necessarily a function of personality dynamics of the defensive or self- 
critical sort but relatively ‘“‘neutral’’ psychologically (e.g. semantic vari- 
ation), lead to score deviations that are misleading. Such problems have 
been investigated by Benton (6), Eisenberg (14), and Eisenberg and 
Wesman (15). 

A more fruitful attitude was taken by Rosenzweig (53) in which he 
reiterated the fact of untrustworthiness of self-ratings and indicated that 
instead of trying to completely eliminate these sources of error we should 
recognize them and attempt to “correct” for them in interpreting the 
results. He says, 


“Astute phraseology in the instructions and questions of the test have 
sometimes been resorted to, but such expedients are rarely very effective. 
Might it not be more effective to recognize at the outset that such tests have 
certain limitations that can never be completely circumvented and then go on 
to the measurement of these limiting factors themselves, thus obtaining infor- 
mation by which a correction may be applied to the subject’s answers?’’ (53). 


Rosenzweig’s specific proposal for achieving this end was to include 
among the usual self-rating items a set of items of the form “I should like 
to be the sort of man who . . .” on the theory that if we knew something 
of the strength of certain “‘ideal-self” trends in the person, we could make 
appropriate correction for these trends in interpreting responses to the 
traditional items. Rosenzweig never carried this idea into practice and 
there is no way of telling whether or not it would have worked. It seems 
to the writers that it would be relatively ineffective, since what is desired 
is not a statement of the strength or number of ideals for the self, but a 
measure of the extent to which they are allowed to distort responses. In 
other words, a subject might easily have quite lofty ideals verbally ex- 
pressed, but. might be too honest, insightful, objective, or self-critical to 
distort his responses into agreement with these ideals. It is, for example, 
rather characteristic of psychasthenic persons to express high and often 
unattainable ideals of perfection and achievement; whereas at the same 
time they are prone to be excessively self-critical, a fact which is psy- 
chometrically reflected in the negative correlation of the Pt (psychas- 
thenia) scale of MMPI with some of the subtle “‘lie” scales which will be 
discussed below. 

Maller (40) attempted to solve this problem in another way in his 
Character sketches, by including a small set of items which were supposed 
to measure the subject’s “readiness to confide.”” The occurrence of very 
normal, well-adjusted scores in combination with a low measured “‘readi- 
ness to confide” would lead one to be sceptical of the validity of the 
measurement. This was a material advance in principle, except that the 
“readiness to confide” items were themselves self-ratings on that very 
readiness. In the later form called Personality Sketches Maller does not 
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make use of this procedure so we may assume that it was unsuccessful or 
at least did not materially improve validity. 

Carrying Rosenzweig’s thinking to its logical conclusion, the obvious 
procedure to follow is to give the subject a good chance to distort his 
answers in accordance with some self-picture or conscious fagade, and 
observe the extent to which he does so. The difficulty here is that such a 
procedure requires a knowledge of the objective facts (and the subjective 
facts!) which is usually inaccessible to us. Here there are three possibi- 
lities open to the test-builder. First, he may sidestep the problem of 
getting directly at the objective truth, and attempt to establish falsehood 
by obtaining internal contradictions. This was another technique em- 
ployed by Maller in his earlier test. Cady (13), in his application of a 
modified form of the Woodworth Psychoneurotic Inventory to the meas- 
urement of juvenile incorrigibility, had earlier made use of repeated items 
to increase reliability of the scores; although the aim of detecting inconsis- 
tency of the “fake” sort was not explicit in his rationale. Each question 
appeared twice, once in each section of the test, except that in the second 
appearance the question was phrased in the negative. Theoretically the 
subject’s response should also be reversed; and the number of failures to 
reverse is an indication of some inconsistency and hence, Maller assumes 
of non-cooperation or dishonesty. The “inconsistency score’’ obtained in 
this way was to be subtracted from the adjustment score to get.a sort of 
corrected score as proposed by Rosenzweig. It is by no means obvious 
that the shift to a negative form of item will leave the projective properties 
of the stimulus simply reversed in meaning; so that the fact of an “incon- 
sistency”’ in the strict logical sense would not necessarily imply lack of 
cooperation or dishonesty. However, it would seem reasonable that a 
very large number of such inconsistent pairs would cast grave suspicion 
upon the scores, either for dishonesty or some equally serious reason. 
This technique also was abandoned by Maller in his revised instrument. 

The second method of using distortion is to present opportunities for 
answering in a very favorable way but in a way which could almost 
certainly not be true. This idea was employed by Hartshorne and May 
in the Character Education Inquiry (23). Since there are very few 
aspects of behavior for which one could have complete confidence that no 
subject would be “‘ideal’”’ in them, it is necessary to present a considerable 
number of such opportunities and progressively reduce the probability 
that any flesh-and-blood individual would be as described. Everyone 
has at least a few highly desirable traits, and no one has all of them. 
Without knowing anything whatsoever about a particular person, we can - 
write down on common-sense grounds a list of extremely good and rare 
human qualities which it'is statistically absurd to suppose will all or in 
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large part be his. If he says, however, that he has all (or a very great 
many) of them, we decide that he is not telling the truth. To practically 
clinch this argument it is only needful to choose desirable attributes 
which will very rarely belong, even singly, to anyone; and which further- 
more relatively few normal persons claim for themselves when given the 
chance., In the mass the answers to these items may yield very strong 
evidence for deception. “I sometimes put off until tomorrow what I 
ought to do today” can be answered False by very few honest people. If 
a subject gives such responses with some considerable frequency, the 
inference is obvious. A more detailed discussion of this approach will be 
given in section III below. 

The Humm-Wadsworth Temperament Scales and the Minnesota 
Multiphasic Personality Inventory have both made use of this method, 
the latter more explicitly. Humm and Wadsworth (29) deserve credit 
for having been among the first investigators of structured personality 
measurement to lay great stress upon the problem of detecting non-co- 
operation and distortion of response when evaluating a particular profile 
of scores. They were also among the first to adopt an explicit and un- 
compromising empiricism in selecting items from a large initial pool. 
The two scales which serve as “checks” or “‘correctors’’ for the remainder 
of the profile on the Humm-Wadsworth are the “Normal” component 
and the “no-count.” The Normal component is rather difficult to 
evaluate from the theoretical point of view, for reasons which have been 
given elsewhere by one of the present writers (47). It is sufficient here to 
indicate merely its function as described by Humm and Wadsworth, 
which is to assess the strength of a general inhibiting, controlling, or 
normalizing factor in personality which serves to act as a “brake” upon 
strong abnormal tendencies on the other variables. This means that in 
interpreting a given profile, the significance of any deviation on one of 
the abnormal components must be established with the size of the Normal 
score in mind. To the extent that the Normal component measures what 
the authors claim for it, it is not especially relevant to the present prob- 
lem; but if it actually operates by detecting something other than the 
personality component they describe, it would perhaps be of significance 
here. For a more detailed discussion of this question the reader is re- 
ferred to the study cited above. 

The “no-count” is based upon the number of items to which the sub- 
ject responds in the negative. Inasmuch as approximately 76 per cent of 
the scored items (87 per cent of the total pool) of the Humm-Wadsworth 
are “obviously” suggestive of abnormality when replied to affirmatively, 
the ‘‘no-count” is to some extent a measure of the testee’s tendency to 
avoid, consciously or otherwise, saying ‘“‘bad”’ things about himself when 
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taking the test. That this relationship obtains is further supported by 
the tendency for the no-count to correlate positively (.77) with the 
“Normal” component and negatively (—.39 to —.72) with the various 
abnormal components (29). If the no-count is excessively great, the 
inference is that the subject has responded in a very defensive or possibly 
(as in some psychotics) stereotyped fashion; and therefore the particular 
testing is of doubtful validity. In another article, Humm and Wads- 
worth state that as high as 25 or 30 per cent of normals seem to invalidate 
their scores in this way, a proportion which would seem to be impractically 
high for clinical purposes. In a later article (30) they attempt to reduce 
the proportion of useless tests by a “correction” for the no-count based 
upon multiple regression procedures. Humm and Wadsworth state 
that in a subsequent group of cases “‘well known” to them, the improved 
validity of profiles thus corrected was demonstrated. An unpublished 
study of hospitalized psychiatric cases by Arnold (5) indicated that even 
the exclusion of cases with “invalid” no-count did not result in any 
greater validity clinically than was obtained using all cases. Humm 
(personal communication) states that improved multiple regression tech- 
niques have resulted in a very marked reduction in the proportion of test 
misses and of uninterpretable profiles. These more recent data on the 
Humm-Wadwsorth have not been published. On present evidence it is 
difficult to say to what extent the use of multiple regression technique was 
successful in improving validity. 

Washburne, in revising his “Test of Social Adjustment” (OSPA), in- 
cluded a set of 21 items modeled after the “lie” items of Hartshorne and 
May and referred to the total score on this set as objectivity. This score 
was included to detect both lying and unintentional inaccuracy, and the 
author reports that interviews with people showing very low objectivity 
scores showed that “it was useless to question them.”’ A very low ob- 
jectivity score was said to invalidate the test as a whole, and a weighted 
objectivity score was included in the total score on the entire test (63). 

Another application of the second method for detecting invalidity by 
identifying the presence of distortion was the “lie” scale (and its comple- 
ment, F) of the MMPI, which will be discussed in detail in section III 
below. 

The third technique available is the empirical derivation of a ‘fake’ 
scale by making use of the item shifts obtained when persons take a test 
under normal “naive” conditions and then are retested with instructions 
to fake. This method has been used by Ruch to construct an “honesty” 
key for the Bernreuter. It is interesting that a procedure so logical and 
straight-forward, invented to solve a problem so obvious and insistent, 
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should have been employed for the first time over twenty years after the 
appearance of the first personality inventory. Ruch says: 


“The argument is rather simple. If answers to items on a test like the 
Bernreuter can be faked at all, the chances are that some are easier to fake than 
others. Therefore, it should be possible to give each item a weight to repre- 
sent the extent to which it can be faked by the average college student. This 
was done'by tabulating the frequency of each answer to each question for the 
standard ‘condition and for the influenced condition. These frequencies were 
converted into percentages, and an ‘honesty’ weight was assigned to each reply 
according to the magnitude of the critical ratio of the difference between the 
frequency of the reply in the honest and in the influenced condition” (55). 


In applying this honesty scale to a new group he was able to show that 
all cases of “‘real’’ introverts would be detected in an attempt to make 
themselves appear extroverted on the test. There are a number of inter- 
esting problems presented by this method, such as the extent to which 
the key would work if the subjects were not under actual instructions to 
fake extrovert but were being more “subtle” and actually trying to 
deceive an examiner in a real life situation. Presumably the deviation 
toward dishonesty would not be as great under such circumstances. The 
use of the critical ratio as a basis for weighting items might also be open 
to some question. In any event, Ruch seems to have been the first 
investigator to attempt empirical derivation of a fake key for a question- 
answer personality inventory. The results of applying this procedure 
to work on MMPI will follow in the present article. 

As was mentioned earlier, there is some evidence of a tendency in the 
opposite direction in taking personality tests. It is difficult to character- 
ize such a tendency, especially since it may occur on several different 
bases. A patient in the hospital may for instance engage in a sort of 
“psychiatric malingering” for strictly conscious reasons, presenting a 
profile on a test such as MMPI which shows abnormalities out of all 
reasonable proportion to what is apparent from other considerations. 
Again, there may be somewhat general traits of verbal pessimism or self- 
deprecation which, while of some relevance personologically, act so as to 
systematically distort the results of personality measurement. We shall 
dichotomize the test-attitude continuum by the two opposed terms 
“defensiveness” and “plus-getting,” not implying anything as to the 
degree of conscious, deliberate deception involved in either. The cor- 
responding extremes, where such deliberate deception seems likely, we 
shall refer to as “faking good’”’ and “faking bad” respectively. It is 
recognized that, like the defensive tendency, the “plus-getting”’ tendency 
may exist in all degrees from a mild self-criticality or merely objectivity 
to a deliberate, conscious attempt to make oneself look psychiatrically 
abnormal. Whether this represents simply the extreme of a continuum 
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with faking good at the opposite end, or an entirely new and different 
factor, we shall for the moment leave aside. In any case it would be 
desirable to develop a scale for detecting these tendencies to put oneself in 
a bad light when answering a personality inventory, so that allowance 
might be made in such cases in the light of a deviant score oLtained on 
such a scale. The F scale of MMPI was not originally developed with 
this in mind, but subsequent evidence showed that it could be used in this 
way (see below). Presumably the two “correction” scales C, (42) and 
Ca (25) which were found necessary in the early attempts to detect hypo- 
chondriasis and symptomatic depression were at least partially dependent 
upon the operation of such a plus-getting tendency. 

A systematic investigation of the plus-getting tendency was attempted 
by one of the writers, which resulted in the development of a somewhat 
more generalized correction scale which was called N. The details of 
derivation and interpretation of this scale are reported elsewhere (47) and 
will not be repeated here. Suffice it to say that from a study of the item 
responses made by a group of presumably normal persons who showed 
abnormal MMPI profiles as contrasted with a group of clinically ab- 
normal persoas with matched profiles, a group of items was isolated which 
could be used to roughly quantify the plus-getting tendency. It was 
found that normal persons who show distinctly abnormal (maladjusted) 
profiles on the personality scales proper, tended to answer this selected 
set of N items in the “obviously” maladjusted direction, which was with 
few exceptions also the direction of response given by a minority of the 
unselected normal population. In other words, a person who is clinically 
normal in spite of having an abnormal profile shows a tendency to give 
statistically uncommon answers which are also “maladjusted” answers in 
the sense that by inspection they would be considered evidence of psy- 
chiatric involvement. For example, about 48 per cent of the unselected 
general population normals answer “True” to the item “A windstorm 
terrifies me.” Yet we find that among those normals selected specifically 
for showing apparently abnormal profiles on the personality scales proper, 
about 62 per cent give an affirmative answer to this question. Persons 
having MMPI profiles no more deviant than these plus-getting normals 
but who are actually abnormal clinically, give an affirmative answer about 
26 per cent of the time. Thus if a person shows an otherwise deviant 
profile but states that he is terrified by windstorms he stands a better 
chance of being clinically normal than one who gives the a priori more 
“normal” or “adjusted” response. Similar items on the N scale include 
such things as “I am afraid of fire,” “I have a fear of water,’’ “People 
often disappoint me,” “I did not like school,” and so on. Inspection of 
these items and an examination of the correlations between N and the 
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other MMPI scales led to a conviction that the N scale was actually 
detecting a diffuse plus-getting tendency of the sort described. It was 
further shown that either the inspectional or mechanical use of the N 
scale in order to under-interpret profiles having the plus-getting tendency 
led to a reduction in the number of false positives in identification of psy- 
chiatric cases. However, the N scale was rather long, and was also ap- 
parently loaded with genuine psychiatric factors which led to an unde- 
sirable under-interpretation of profiles belonging to grossly abnormal 
persons. It is therefore to be seen merely as a beginning attempt which 
was supplanted by K as will be described below. 


II. MMPI Scale F 


The MMPI variables F and L were not formally validated originally, 
but were presented on face validity, that is, we assumed their validity on 
a priori grounds. The F variable was composed of 64 items that were 
selected primarily because they were answered with a relatively low 
frequency in either the true or false direction by the main normal group; 
the scored direction of response is the one which is rarely made by un- 
selected normals. Additionally, the items were chosen to include a 
variety of content so that it was unlikely that any particular pattern 
would cause an individual to answer many of the items in the unusual 
direction. A few examples are: “Everything tastes the same.” True. 
“T believe in law enforcement.” False. “I see things, animals, or people 
around me that others do not see.” True. The relative success of this 
selection of items, with the deliberate intent of forcing the average num- 
ber of items answered in an unusual direction downward, is illustrated in 
the fact that the mean score on the 64 items runs between two and four 
points for all normal groups. The distribution curve is, of course, very 
skewed positively; and the higher scores approach half the number of 
items. In distributions of ordinary persons the frequency of scores drops 
very rapidly at about seven and is at the two or three per cent level by 
score twelve. Because of this quick cutting off of the curve the scores 
seven and twelve were arbitrarily assigned T-scored values of 60 and 70 in 
the original F table. 

From the first it was recognized that F represented several things. 
Most simply, since the subject would need to sort almost all of the items 
according to expectation in order for these low scores to result, any error 
in recording, such as mistaking true items for false items and the like, 
would raise the F score appreciably. Similarly, if a subject could not 
understand what he was reading adequately enough to make conventional 
answers to these items, the F score would obviously be higher. It was 
felt to be axiomatic that this method would eliminate as invalid records 
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of subjects who could not read and comprehend or who refused to co- 
operate sufficiently to make expected placements. 

In addition, however, it was early discovered that schizoid subjects 
and subjects who apparently wished to put themselves in a bad light also 
obtained high scores. The schizoid group obtained high scores because, 
due to delusional or other aberrant mental states, they said very unusual 
things in responding to the items and thus obtained high F scores. This 
is referred to as distortion since we feel that an impartial study would not 
justify the patient’s placements. Among more normal persons some high 
scores were also observed where the individual had rather unusual ways 
of responding to conventional stimuli such as are represented by the items 
involved. For example, to the item, “I have had periods in which I 
carried on activities without knowing later what I had been doing,’’ most 
persons answered false. Some persons, however, included periods of 
sleep in the implication of the item. One might argue that such ways of 
thinking are often allied to schizoid mentation generally and that the 
answers in this case indicate a true abnormality. At the very least, 
however, the person is responding to some items in a way that differs from 
that of most individuals. Such persons might, therefore, not be ap- 
propriately approached through this method of personality measurement. 
It seems a reasonable enough possibility that there are individuals whose 
habitual ways of reacting to items are so different from their fellows that 
measurement of their personalities through the use of verbal items of this 
type would reflect the unusualness of their reactions to the items more 
than any clinical abnormality. This semantic factor has been treated 
_ More completely elsewhere (6) (14) (47). In so far as such a possibility 
may exist we have not yet separated it from the clinically more important 
abnormality expressed in the Sc scale. Parenthetically, one of the most 
persistent difficulties with developing the Sc scale was this very fact, that an 
appreciable number of individuals obtained high scores on Sc without being 
marked by a clinically important degree of abnormality. They, neverthe- 
less, as indicated above, were responding differently from other people 
about them as represented by the original data from the general popula- 
tion. It appears that the essential difference clinically is concerned with 
the particular manifestation of unusual mentation in the individual. If 
this is not too clearly counter to society’s mores, the person may not be 
thought of as schizoid by those about him though he is often recognized 
as queer. 

Clinical experience suggests that the usual critical score of 7 = 70 is 
too low in the case of F. We have found that scores ranging up to T = 
80 (raw score 16) are more often a reflection of “validly” unusual symp- 
toms and attitudes than an indication of invalidity in the rest of the 
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profile due to misunderstanding, etc. Raw scores much above this, 
however, strongly suggest an invalid record. 

With the problems of measurement that developed in the armed 
forces where a subject might be expected frequently to attempt to put 
himself in a bad light in answering the MMPI, the F score became 
especially interesting. It was, of course, immediately possible to con- 
sider the F score as an evidence of this attempt to malinger and obtain 
fallaciously bad scores on other scales. As a check on this interpretation, 
a group of 54 ASTP men who had completed a considerable portion of 
their training in psychology were asked to take the Multiphasic twice.’ 
The group took it in the usual way and also took it under instructions to 
assume that they wished to avoid being accepted in the draft and in order 
to be rejected they were to obtain adverse scores without giving them- 
selves away. Half the group took it with “fake” instructions first, half 
second. Through this plan we had a “faked bad” record and a “normal’”’ 
record. The data showed that 96 per cent of the “faked bad’ records 
had a raw score F of 15 or more (7 > 78), indicating that even these men 
who were somewhat cognizant of psychological. measurements betrayed 
themselves when they attempted to fake a bad record. The resulting 
profiles were nearly all unusually abnormal in character, and most of 
them would have appeared invalid from their pattern alone to one who 
has seen a large number of abnormal profiles. 

From this experiment it appeared that F was a very good device for 
identifying the intentional faking that could be set up in an experimental 
situation. It still seemed desirable, however, to attempt to separate 
among the individuals obtaining relatively high F scores those who were 
of the above-described schizoid type or who were simply over pessimistic 
in their view of themselves, from those who for one reason or another 
faked a bad score or did not understand the items. 


III. The L Scale 


The original 15 L items making up the L scale were selected under the 
inspiration of the work of Hartshorne and May. Each of the items pre- 
sents a situation desirable socially, but rarely true of the individual. For 
example in the item, “I read all the editorials in the newspaper every 
day,”’ we have a type of behavior that might be generally considered 
desirable but which is true of only a small percentage of the population. 
It was recognized that very conscientious persons would frequently have 
more than average of these L items truly positive but for a person to have 
six or eight of them seemed almost impossibly good. The 15 items of 


1 We are indebted to Dr. Howard F. Hunt for administration of these tests. 




















538 Paul E. Meehl and Starke R. Hathaway 


this type scattered among the main body of the items, constituted a fairly 
subtle trap for anyone who wanted to give an unusually good impression 
of himself. 

Among the various normal groups the mean score on the L items lies 
between three and five. As in the case of F the frequency curves are all 
skewed sharply in the positive direction. Very few individuals obtain 
raw scores of seven or more, and the two or three per cent level is at about 
ten. These values were arbitrarily called the 60 and 70 T-score points, 
respectively. As the L score was used in the clinical setting and as some 
data began to accumulate from personnel workers in industrial situations, 
it became apparent that the assumptions regarding the meaning of L were 
in the main correct, but that there were also other valid interpretations of 
L, at least in the range from T-score 56 to 70. In fact we found ourselves 
placing considerable emphasis on T-scores of 56 to 60 which indicated 
that the original arbitrary assignment of T-scores had been too conserva- 
tive. On the other hand while the positive presence of the rise in the L 
score seemed quite valid as an indicator that the individual taking the 
test was being dishonest and might be somewhat unreliable, if no rise in 
L was observed, the finding could not be so positively and clearly inter- 
preted. The L score was a trap for the naive subject but easily avoided 
by more sophisticated subjects. 

To check the assumption that L would not identify the more sophisti- 
cated subject an experiment was performed with ASTP psychology stu- 
dents. As in the study cited under Section II above, 53 men were given 
the MMPI twice. The “faked good” data were obtained under the 
instruction to make certain in taking the test that they would be ac- 
ceptable to army induction. These records showed no appreciable rise 
in L. It is also true, however, that the majority of the profiles were only 
slightly, if any, better than the corresponding non-fake profiles. This 
experiment would have been improved if persons whoce true profiles were 
abnormal had been used. Some data have been collected from such cases 
but the number is small. At least, one may conclude that the intent to 
deceive is not often detectable by L when the subjects are relatively 
normal and sophisticated. 


IV. The K Scale 


In summary there were two basic lines of experimental approach to 
the problem of identifying the attitude a subject takes toward the items 
that he is faced with in the personality inventory.* Each of these two 


*Harmon and Wiener (personal communication) have investigated the possibility 
of detecting defensive and plus-getting tendencies through a division of certain MMPI 
scales into “subtle” and “obvious” items. Separate T-scores may then be calculated 
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approaches permits a subdivision into several methods. First, we may 
have the subject deliberately assume a generally defined attitude, as in 
the study by Ruch. For example, we may ask him to attempt deliber- 
ately to obtain adverse scores while not betraying his intention, and 
secondly, we may choose records in which there is presumptive likelihood 
that a special attitude has been assumed. The first approach may be 
subdivided into those experiments in which the “faking” is directed to- 
ward obtaining adverse scores and the approach in which the intention is 
to obtain desirable scores. In both latter cases an additional set of 
responses must be obtained relatively simultaneously with the “faked’’ 
responses in which the individual assumes his ordinary attitude. The 
“faked” and ‘“‘normal” records can then be contrasted for study. One 
may then make an item analysis to discover the items that are most 
frequently changed from the “normal’’ records as contrasted to the 
“fake” records. Using these “fake” approaches, several scales were 
derived. 

It was found that the items indicating an attempt to obtain a bad 
record are not necessarily those derived by analysis of records where the 
subjects attempted to obtain a good record. Our first finding in this 
regard was that either of these procedures provided a scale that would be 
about as good for the other type of faking as it was for the one from which 
it was derived when such scales were applied to test cases not used in the 
original derivations. It was further found that using two such scales 
separately did not materially increase the predictive value. As has al- 
ready been pointed out, it was also found that the original F scale was as 
effective as was needed to identify those persons who intentionally at- 
tempted to obtain a bad score at least within the range of the experiments 
that we conducted. Conversely, the L scale was not effective nor were 
any of the specially derived scales especially effective in identifying 
sophisticated persons who deliberately attempted to obtain better scores. 
In all of these experiments the findings were so complex and the time de- 
voted to many subprojects was so great that we shall only present data 
for the final scale K (see below). 

In the second line of experimental approach there are also several sub- 
divisions. One may find among presumably functional and normal re- 
cords those records which are so abnormal as to indicate that the individ- 
ual should have been in a hospital and attempt to discover the items 


for the subtle and obvious scores on each scale so treated, and in terms of the discrep- 
ancy between 8 and O one may form a judgment as to the strength of the defensive or 
plus-getting test attitude of the subject. This ingenious technique is still in process of 
investigation by its inventors and a more adequate treatment of the method and its 
results will presumably be forthcoming from them later. 
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among these records that will differentiate them from the records of 
actually abnormal persons. For the counterpart to this approach one 
chooses cases who were in the hospital but whose records show a normal 
profile. These may likewise be compared by item analysis to the records 
of hospital patients with suitably abnormal profiles who would be assumed 
to have had no interfering test taking attitude. Using this approach we 
also derived several scales and made many experimental tests of them. 
Again the details of all of these are not worthy of the complex presenta- 
tion they would require and these preliminary results will merely be sum- 
marized. 

The first and most important finding was that whichever of these 
methods was used, as was the case with the “faked” approach above, the 
resultant scales were about equally effective and about equally unsatis- 
factory regardless of the approach and of the particular item content. 
These scales were also rather effective in differentiating the “fake” group 
and in some cases were just as valid for that purpose as were the scales 
derived by that approach. After some two years of this experimentation 
all of the scales that had showed any promise were reconsidered by apply- 
ing them to various available groups that had not been used in their deri- 
vation and from among them all a single scale which was originally called 
L6 was chosen as the best. It should be recognized that L6 was not 
entirely satisfactory but its action in several of the sample situations re- 
sulted in its tentative adoption. Although as indicated in the above 
summary the particular derivation does not seem to play an important 
part since we could not easily distinguish a scale as having been derived 
by a special process when we examined its action; nevertheless it may be 
desirable to tell how L6 was derived. It must not be forgotten that 
several other scales resulting from the other methods were very nearly as 
good as was L6, especially the plus-getting scale N. However, when the 
N scale and L6 were compared and even applied to the test situation set 
up for the N scale, L6 was a close competitor with N and in several in- 
stances was actually better. 

In brief, L6 was derived by an item analysis of the responses of 25 
males and 25 females in the psychopathic hospital whose profiles showed 
an L score of T = 60 or more and who, with the exception of six normal 
cases, had diagnoses indicating the probability that they should have had 
abnormal profiles but whose profiles were in reality within normal range. 
The diagnoses given to these cases by the psychiatric staff were mostly 
psychopathic personality, alcoholism and allied descriptive terms indicat- 
ing behavior disorders rather than neuroses. In general one would expect 
persons with such diagnoses to be rather more likely to be defensive in 
taking a personality test than cases of psychoneurosis. There are a few 
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exceptions, however, in the case of hysteria where as has been pointed out 
in previous papers (44, 46, 47) there is a tendency for the hysteria to be 
based upon something closely allied to the assuming of an overly perfect 
attitude in answering personality items. A particular listing of the 
diagnoses among these cases is not given here because the diagnostic 
categories are not clear enough to be of additional value. In summary, 
two criteria were employed in the selection of the criterion group. Practi- 
cally all of them were individuals known to be characterized by deviant 
behavior but they obtained relatively normal profiles and were thus what 
we have called misses for the Inventory; and all of these criterion cases 
were also characterized by having a tendency to obtain elevated scores on 
the original L scaie. 

The item responses of these fifty cases handled separately for males 
and females were compared to the male and female item frequencies from 
the general group of males and females that has been used in past scale 
derivations. In all, 22 items were chosen as a result of this comparison. 
All of these items showed a per cent difference of 30 or more between the 
criterion cases and the control group, males and females being considered 
separately. 

These 22 items are given below with a T or an F to indicate the 
direction of the answer that would agree with the tendency of the criterion 
group. Since the criterion group is assumed to be a group with intent to 


obtain good scores the larger raw scores on these items are in the same 
direction as the larger raw scores on the original L scale. The content of 
these items would seem to suggest an attitude of denying worries, inferi- 
ority feelings, and psychiatrically unhealthy symptoms, together with a 
disposition to see only good in others as well as oneself. The over-all 
impression is one of “impunitiveness” if we may apply Rosenzweig’s 
term in a situation where no particular frustration is involved. 


B-55 I have very few quarrels with members of my family. (T) 

C-18 I worry over money and business. (F) 

C-27 It makes me impatient to have people ask my advice or other- 
wise interrupt me when I am working on something important. 
(F) 

©-33 It takes a lot of argument to convince most people of the truth. 
(F) 

D-53 I think nearly anyone would tell a lie to keep out of trouble. 
(F) 

D-54 Most people will use somewhat unfair means to gain profit or 
an advantage rather than to lose it. (F) 

E-43 When in a group of people I have trouble thinking of the right 
things to talk about. (F) 
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E-44 I find it hard to make talk when I meet new people. (F) 

E-52 People often disappoint me. (F) 

F-8 It makes me uncomfortable to put on a stunt a at party even 
when others are doing the same sort of things. (F) 

F-33 Often I can’t understand why I have been so cross and grouchy. 
(F) 

F-34 Criticism or scolding hurts me terribly. (F) 

F-43 At periods my mind seems to work more slowly than usual. (F) 

F-46 I frequently find myself worrying about something. (F) 

G-18 I have periods in which I feel unusually cheerful without any 
special reason. (F) 

G-29 I get mad easily and then get over it soon. (F) 

G-30 At times my thoughts have raced ahead faster than I could 
speak them. (F) 

G-31 At times I feel like smashing things. (F) 

I-22 I have often met people who were supposed to be experts who 
were no better than I. (F) 

I-31 I have sometimes felt that difficulties were piling up so high that 
I could not overcome them. (F) 

I-37 I certainly feel useless at times. (F) 

I-38 I often think “I wish I were a child again.” (F) 


. Following the final choice of Le as the best of the scales available, we 
l subjected it to more careful study and went back through hospital and 
normal records to find out if it seemed to be of any help in interpreting 
individual profiles. There were relatively few data on normal cases but 
on hospital cases a fairly extensive symptomatic summary was available 
that would permit us to judge whether or not a patient should have had 
a normal profile. We could then look up the profile and if it was normal 
we could check to see if the L, deviated in an upward direction indicating 
that the patient had attempted to place himself in a good light. As a 
result of this study Ls appeared effective but left much to be desired. 

Since in the summary of scales when Ls was chosen for intensive study, 
it had seemed about as adequate for the detection of plus-getting as was 
N or any of the other experimental scales, the records of a new series of 
presumably normal persons showing deviant profiles was examined and 
it was again true that L, appeared to work at the plus-getting end of the 
test-attitude continuum. That is to say, a relatively low score on Le 
could be used to under-interpret an otherwise deviant profile and thus 
avoid some of the presumably false positives in the normal population 
sample. Thus L, seemed useful at “both ends” of the test-attitude 
continuum, defensiveness and plus-getting. 
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The most outstanding difficulty in such a procedure was that Ls 
tended to be low on severe depressive or schizophrenic patient records 
and thus lead to an under-interpretation in spite of the fact that the 
patients were very grossly abnormal. To partly correct for this tendency, 
items were added that would work in the opposite direction. To choose 
these we studied the item tabulations for the group of ASTP men who had 
attempted to fake good and bad scores. In this study there were many 
items which showed no tendency to change with an alteration in the test- 
taking attitude. That is, the per cent of true or false, as the case might 
be, remained constant whether the attitude was the normal one or the 
faked one. From among these items, a sub-group was chosen which 
showed differences between schizophrenic and depressive criterion groups 
and general population normals. The procedure rested upon the ad- 
mittedly somewhat shaky assumption that any itera that did not appear 
to be much affected by the test-taking attitude as approached by a normal 
person attempting consciously to “fake” good or bad but which did occur 
as a frequent item to differentiate depressed or schizophrenic patients 
would be useful in correcting the tendency of our Ls scale to go too low 
for schizophrenic and depressed patients. Of course such an item was 
scored in a way that would make it work against the tendency of the Ls 
scale. Eight items were selected by this method. The effect of adding 
these eight items to the 22 on Ls was of course to elevate slightly the mean 
score of normals and make it more nearly approach the mean score of 
abnormal cases on the complex of all 30 items. ‘The eight items chosen 
by this procedure are given below. The letters T and F indicate the 
response scored in the “lie” direction, and in the direction characteristic 
of schizophrenic and depressed cases. 


A-3 I have never felt better in my life than I do now. (F) 

C-28 I find it hard to set aside a task that I have undertaken, even 
for a short time. (F) 

D-48 I think a great many people exaggerate their misfortunes in 
order to gain the sympathy and help of others. (F) 

D-51 I am against giving money to beggars. (F) 

F-7 What others think of me does not bother me. (F) 

F-20 I like to let people know where I stand on things. (F) 

G-23 At times I am all full of energy. (F) 

J-51 At times I feel like swearing. (F) 


As a final step these eight items were combined with the 22 Ls items 
into a single scale which we have called K. The K scale represents the 
final outcome of many experiments in the general field of measuring test 
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attitude. The K scale is far from perfect for its purpose as measured by 
the various available data. Generally speaking it is about as good as 
any other single scale derived for any one of the single purposes that have 
been described. In individual applications it is inferior now to one scale 
and now to another but the differences are never great enough to be very 
significant practically and the small number of items in this scale gives it a 
distinct advantage over one or two of the longer scales such as N. Fi- 
nally, as was stated above it is not expedient to present more than a single 
scale although a slight advantage could have been gained if two scales 
analogous to the original L and F scales had been separately presented. 

The construction of K being what it was, odd-even or Kuder-Richard- 
son reliabilities were not computed. Test-retest coefficients were .72 
and .74 computed on two groups, one of which was retested at intervals 
varying from one day to over a year, the other after a lapse of 4-15 months. 

Since the K scale was derived as a correction scale or suppressor vari- 
able (28, 48) for improving the discrimination yielded on the already 
existent personality scales, it was not assumed to be measuring anything 
which in itself is of psychiatric significance. Actually, its relationship 
with such clinical variables as the subtle Hy items (see below) might 
suggest an interpretation of K alone; further, it is presumably a signifi- 
cant fact about a person that, in answering a personality inventory, he 
tends to behave as a “liar” or a “plus-getter.”” However, the real func- 
tion of K is intended to be the correction of the other scores; and validity 
will be discussed with reference to this function only. 

It is first necessary to choose criterion cases of the sort on which K can 
conceivably be of value. It is clear that such cases will be characterized 
by the presence of what may be called borderline profiles, i.e., those show- 
ing T-scores, say, between 65 and 80. The reason for this is that in 
studying hundreds of deviant profiles after the addition of K, almost no 
individuals were found with T-scores above 80 in the normal sample, and 
it was not statistically profitable to correct elevations of such magnitude 
to the point of calling them normal. On the other hand, when a curve 
shows no elevations at all above 65, even the presence of a high K score 
does not enable the clinician to form any adequate notion of what the 
peak would be, if any, had the K-factor not been operating to distort the 
results. In other words, there are upper and lower limits beyond which 
deviations on K cannot effectively operate. Profiles showing scores above 
80 are to be interpreted as probably “abnormal” no matter how low K 
falls; while if a profile shows no scores above 65 we cannot tell whether a 
high K means the profile should be adjusted toward more severe scores 
or is merely that of an actually normal person who for some reason or 
other took a defensive attitude when being tested. The kind of curve 
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which gives interpretative difficulty and which could conceivably be im- 
proved by knowledge of the influence of K would be a curve in the doubt- 
ful, borderline region. Accordingly, a group of cases from the normal and 
hospital groups was chosen on the basis of having achieved such border- 
line curves. We selected for this study all cases in the files showing at 
least one personality component * elevated as high as JT = 65, but no 
component elevated to T > 80. Among the normals, there were 174 
having such borderline curves, of whch 71 were males and 103 were 
females. Corresponding to these cases, we located among our clinically 
abnormal cases 129 males and 208 females with similar borderline pro- 
files. The data for the two sexes were treated separately. 

The analysis of these data was in terms of the ability of the K scale, 
used mechanically as will be described, to separate the curves of the actual 
normals from those of the actual abnormals. For each sex group, the 
procedure was to arrange the whole set (normals and abnormals com- 
bined) in order of the magnitude of their K scores. The distribution of 
K was cut on the basis of the proportion of normals and abnormals in the 
sample, calling all cases above the cut ‘‘abnormal” and all those below 
“normal.” Setting up a fourfold table on this basis, a chi-square of 
20.436 for the males and 29.540 for the females was obtained. Both of 
these are highly significant (P < .001) with 1 df. If, instead of locating 
an optimal cutting score the K distribution was cut at the mean of the 
general population K distribution (i.e., at T = 50 regardless of the pre- 
sent samples) the cutting point of the males is unchanged, whereas that 
for the females shifts enough to lower their chi-square to 17.750, which is 
still highly significant. In other words, if one considers miscellaneous 
profiles which lie in the borderline range between 65 and 80, regardless 
of the kind of elevation and irrespective of the clinical diagnosis of those 
who are clinically abnormal, he can separate them into “actual’’ normals 
and abnormals significantly better than chance by using a cutting score 
on K. It must be emphasized again that K in this instance is operating 
chiefly as a suppressor of certain test-taking tendencies, since K by itself 
does not practically differentiate unselected normal and abnormal cases 
(1 to 21% raw score points difference between means for various samples). 
In terms of percentages, it was found that for the males, 72 per cent of the 
abnormals and 61 percent of the actual normals were correctly identified. 
For the females, 66 per cent of the abnormals were identified as such and 
59 per cent of the normals were so classified. These percentages are 
based upon the separations at a K = 50, taking, therefore, no account of 
the actual normal-abnormal proportions among the present cases. 


* Mf is excluded from consideration here and in all that follows. 
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Evidence from examination of the test misses spotted by K in the 
above data combined with our knowledge of the correlation between K 
and other MMPI scales, indicated that the K correction was more im- 
portant in the case of some scales than of others. Therefore, it was 
decided to analyze the borderline groups in terms of the peak elevation 
of their profiles, in the attempt to identify those particular curves on 
which K could be used with profit. 

The entire group of 511 borderline curves (males and females, normals 
and abnormals pooled) was divided into eight sub-groups, each sub-group 
being composed of cases having the peak score on the same one of the 
eight personality components. Thus, there were 60 curves having the 
peak on Hs, 91 on D, 119 on Hy, 66 on Pd, 38 on Pa, 25 on Pt, 28 on Sc, 
and 52 on Ma. (The difference between this total of 479 cases and the 
511 used in getting the over-all chi-square is due to the exclusion of 32 
profiles on which no ‘‘peak’”’ could be fairly assigned, since two or more of 
the components showed identical T-scores and these were the highest on 
the given curve.) 

The normals and abnormals having borderline curves with the same 
peak score were than separated mechanically by the use of a cutting score 
on K, the proportion of cases above the cutting score being determined on 
the basis of the proportion of actual abnormals versus normals in each 
sub-group. This was unavoidable in the present analysis because the 
relative proportions of actual normals and abnormals varied widely from 
scale to scale and the use of the mean of K would have been grossly mis- 
leading since in some instances the proportions were extremely asym- 
metrical (67). For the eight groups studied in this manner, only three 
showed a significant chi-square (P < .01), namely those having peaks on 
Hs, Pd, and Sc. The Ma group yielded a chi-square between the 10 per 
cent and 20 per cent level of significance. On D, Hy, Pa and Pt the chi- 
squares were all below the 20 per cent level of significance; and the 
pooled chi-square for these five scales (5 d.f.) gave a P > .22. It would 
seem, therefore, that the K-factor may be used with profit in interpreting 
some kinds of profiles but not others. Of course, the failure to dis- 
criminate with K when grouping profiles by peak score does not establish 
that a K-correction might not be profitably added to the single scores 
themselves. This problem will be treated at length in a sequel to the 
present paper. 

One other validating study was done on K. In this instance, we made 
use of a group of 22 normals and 22 abnormals employed in a previous 
study (47). The normals in this set consisted of a random selection from a 
large group of profiles showing any elevation of 70 or over (excluding Mf). 
The abnormals consisted of a heterogeneous group also having at least one 
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score over 70, and included seven psychoneurotics, seven schizophrenics, 
three psychopaths, two alcoholics, two manic-depressive (depressed), and 
one paranoid state, chosen randomly from recent hospital cases. These 
groups had been selected for a different purpose and had not entered into 
the derivation of K in any way. They can also be considered, therefore, 
a fair test group for validation purposes. Without regard for any other 
information concerning the profiles, all cases showing K > 50 were arbi- 
trarily guessed as abnormals, whereas those with K < 50 were called 
normals. The cutting score was therefore also independent of the 
statistics of the present group. Here the K scale worked phenomenally 
well, being much better than the N-scale (which was derived on cases 
some of which were included in this blind diagnosis study). Of the en- 
tire group of 44 cases, 37 were correctly classified when using K in this 
way, a total of 85 per cent hits. It will be recalled that we are here try- 
ing to separate normals and abnormals all of whom have deviant profiles, 
so that this per cent is quite impressive considering the task set for K. 
Of the seven errors in classifying, six are “false positives,” i.e., cases of 
normals showing elevated profiles and K > 50, called therefore abnormal. 
The chi-square for the fourfold table of these data is 21.569 which with 1 
d.f. is highly significant (P < .001). This corresponds to a contingency 
coefficient of .57. Here we have striking evidence of the validity of K 
when used to differentiate between deviant curves of actual normals and 
abnormals. We are not prepared to explain the superiority of this re- 
sult to that given by the analysis previously discussed, except to say that 
the range of abnormal scores in the present analysis was from 70 to 90 
whereas in the previous analysis we used “‘borderline” scores defined as 
lying between 65 and 80. In what way this could make K appear to 
function more effectively in the one case than the other is not clear. Also 
the present study involved only males, where K in general seems to work 
a little better than on females. 

The fact that K is less effective as applied to Some scales than others 
would suggest separate interpretations or cutting scores depending upon 
the kind of profile with which one is confronted. Furthermore, the rough 
classification into “normal” and “abnormal’’ on the basis of a single arbi- 
trary cutting score obviously sacrifices some quantitative information 
about the actual magnitude of the personality scale elevations with re- 
spect to the magnitude of the K score. We do not intend to propose such 
a rough cutting method as the most efficient manner of application for 
K, but are using that form here simply to indicate that K has differenti- 
ating power for what it was hoped to differentiate. The optimal mathe- 
matical procedure in using K as a suppressor igyolves complex issues 
which we shall have to reserve for a later publication. 
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V. Relation of K to Other Test Variables 


The correlation of the K scale with other MMPI variables should 
throw some light upon the question of its differential efficiency on these 
scales, as well as give us some insight into its psychological nature. 
Table 1 below shows the intercorrelations of K with the other personality 
components measured by MMPI. These correlations are based upon 
100 cases in each of the four groups indicated, chronological ages 26-45, 
excluding records having “?”’ > 70 or F > 80. 


Table 1 
Intercorrelations of K with Other MMPI Variables 








Hs D Hy Pd Pa Pt Sc Ma Mf 





Normal males —.30 15 48 —.17 —07 -—67 -—.59 —.36 
Normal females —.35 -.03 .320 -.06 -—-.02 -—.64 -.58 —.28 
Male abnormals —42 -—.29 .11 —.26 -—.19 -—60 -—.60 -—.37 —.08 


Female abnormals -—.17 -—.16 .17 —.21 -—.13 -—.63 -—.58 —.38 04 





Of interest in this table are the following facts. With the exception 
of Hy and one of the four coefficients of D, the correlations are consist- 
ently negative. This is of course to be expecied if K represents the de- 
fensive, lying, or self-deceptive test-taking attitude it was derived to 
measure. The negative correlations with Hs combined with the positive 
correlation with Hy indicate that there must be a fairly high positive 
correlation between K and those non-somatic items on Hy which have 
been previously referred to—the “zero” items on Hy or what Harmon 
and Wiener have called “hy-subtle” (henceforth designated Hy-O).‘ 
Since this latter set of items, although derived by its empiricai separation 
of clinical hysterias from normals, seems to reflect. the self-deceptive and 
impunitive attitude of the hysterical temperament, it is consonant with 
our interpretation of K that it should be markedly correlated with Hy-O. 
The direct evidence on this point will be reported below. The only cor- 
relations of very impressive magnitude which appear in this table are 
those with Pt and Sc. Here they are high negative—the person who 
makes responses characteristic of compulsive and schizoid persons has the 
opposite of the self-deceptive and defensive attitude. In other words, he 
tends to be a “plus-getter” and in this way is distinctly unlike the hysteric. 


* These items are called “zero” items because on the scoring templates they are indi- 
cated with a letter “O,” meaning that one receives a point for the “abnormality” by 
responding in the direction which, on that single item, characterizes the majority of 
general normals. This means that the abnormals in question tend to give the “normal” 
response much more often than the normals do. 
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These correlations are also in harmony with our clinical knowledge of the 
components in question, especially in the case of the psychasthenia. The 
Pt scale has never been considered very satisfactory, and it has been 
shown in unpublished studies that Pt can actually be used as a correction 
scale in the way in which N was used. It is perhaps significant that of all 
the MMPI scales, Pt is the only one for which, lacking a sufficiently large 
criterion group, methods of internal consistency were employed in the 
item selection. Here again we would expect to get a greater operation of 
non-clinical test-taking factors of the K variety. 

It might be thought that such low correlations as occur in the table 
above would preclude any possibility of the use of K as a suppressor. 
There is a tendency for the scales on which K seems “valid’’ by the chi- 
square test to show the higher correlations, with the exception of Pt. It 
will be shown in a subsequent paper that, for the use to which K is put, 
correlations as low as .20 can be utilized to yield very significant and use- 
ful improvements in discrimination. 

At this point we may briefly review some of the previously developed 
scales which are now known to be saturated with what we may call the 
K-factor, since their diverse sources and methods of derivation furnish 
additional strong evidence for our theoretical interpretation of K. Two 
of these scales have never been published, so that their derivation and 
properties must be briefly summarized here. About three years before 
research on the test-taking attitude was begun, Hathaway and W. K. 
Estes, using a variant of the method of internal consistency, developed a 
scale called G. This scale is the only MMPI scale which was derived 
without the use of any kind of criterion external to the test; like those 
personality tests being developed by factor analytic methods at the 
present time, the selection and scoring of items was based wholly upon 
the intercorrelations among the items themselves. Essentially, the pro- 
cedure consisted in locating among a group of 101 unselected normals 
those individuals who, when their answer sheets were used as scoring 
keys, produced the maximum variance of the other 100 scores. The as- 
sumption was that these persons were the most extreme deviates on what- 
ever factor or factors contributed most heavily to the variance and co- 
variance of the total pool of MMPI items. From the evidence adduced 
by Mosier (50), it is of course clear that the “‘purity” or factorial unity of 
this hypothetical underlying continuum is by no means guaranteed by 
such a procedure. Another way of looking at this procedure is to con- 
sider the fact that one maximizes the variance of a set of items by scoring 
them in such a direction as to maximize their mean covariance—since the 
item variances are unaffected by the direction of scoring. Instead of 
actually calculating the variances for the 2° ways of scoring the test, we 
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select individuals who approximate the optimal scoring key. It was 
found that the scoring keys for some 10 individuals selected by this 
method tended to form two distinct clusters, each of which consisted of 
keys (individuals) showing high correlations with one another and high 
negative correlations with the members of the other cluster. An item 
analysis was then carried out on these two small groups, and the items 
resulting were combined into a scale called G (general factor). 

The G scale had a number of interesting properties which were not 
interpretable at the time of its derivation. It showed a very large vari- 
ability, both in absolute terms and as indicated by a coefficient of varia- 
tion. The scores among normals ranged from those who answered none 
of the items in the scored direction, to those who answered all but eight of 
the 62 items in the scored direction—a phenomenon unheard of in the 
other MMPI scales. The odd-even reliability of G was about .93, which 
is considerably higher than the coefficients we typically find in the MMPI 
scales. The item content was that of the typical “neurotic” or ‘“malad- 
justment”’ sort which predominates on a priori scales such as the Thur- 
stone or Bernreuter BI-N. Examples of items are: ‘When in a group of 
people I have trouble thinking of the right things to talk about” (T); 
“T cry easily” (T); “I am certainly lacking in self confidence” (T). It is 
perhaps significant that the most powerful single item in the internal 
consistency sense—which happens in the sample studies to have a cor- 
relation of 1.00 with the entire G-scale—is almost a distilled essence or 
prototype of so-called “neurotic schedule” items: “I am easily embar- 
rassed” (T). The G scale, although derived without recourse to any 
clinical group whatever, nevertheless showed a correlation of .91 with Pt. 
The mean MMPI curves for unselected normals with high G (the “‘neu- 
rotic” end) showed elevations on F, Hs, D, Pd, Pa, Pt, Sc, and Ma, 
especially on Pt and Sc; whereas L (raw score) and Hy tended to fall 
below the mean. The mean profile for normals with low G was almost 
an exact mirror image of this curve. However, G was not found to be 
very effective in the detection of any clinical group or to be particularly 
useful for any purpose; and since at that time no theoretical basis was 
available for interpreting it, the scale was abandoned. Another scale, 
called + (‘‘plus’’), was derived in a similar but not identical manner. 

In the derivation of the original hypochondriasis key, there was 
developed a correction scale called Ch, the function of which was to 
separate actual clinical hypochondriacs from a group of non-hypochon- 
driacal abnormals (mostly schizophrenic and depressed) who attained 
spuriously elevated scores on H. The item content of this Ch key was 
quite puzzling, because although the correction was successful, the items 
did not seem to refer to anything either hypochondriacal or anti-hypo- 
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chondriacal. In fact it was difficult to see what psychological homo- 
geneity, if any, they possessed. For a more detailed description of this 
scale (now no longer in use since the appearance of the modified Hs key) 
the reader is referred to the original article (42). For present purposes 
it is merely necessary to state that the great majority of the items on Ch 
were scored if answered in the statistically rare and obviously ‘“malad- 
justed’’ direction and that they apparently measured some non-somatic 
component of test responses which resulted in spuriously elevated H 
scores in persons who were not actually hypochondriacal. 

Still another scale of the same general sort was derived by Meehl 
and called N. To briefly repeat what has been said above, this scale 
differentiated normals showing elevated profiles from clinical abnormals 
showing no greater profile elevations, and was interpreted as detecting a 
plus-getting test attitude for which scores on the personality components 
proper should be corrected. The type of item occurring on the scale N 
has been discussed above. 

Lastly, we recall to mind the Hy-O items which have been described 
above as reflecting this kind of component, although scored in the op- 
posite direction from N, Ch, and G. 

It is of considerable interest to examine the correlations between K 
and these other variables, derived in their diverse ways. Table 2 pre- 
sents the correlations between K and the various scales thought to be 
loaded with the factor in question, based upon scores of 100 individuals 
ages 26-45 in each of the groups indicated. 


Table 2 


Correlations of K Scale with Other Variables Thought to be Loaded 
with the “‘K-factor” 








+ G N Ch Hy-O 


Normal males — .64 —.76 —.70 — .67 81 
Normal females —.62 —.73 — .64 — .63 .78 
Male abnormals —.70 —.75 —.69 —.64 .74 
Female abnormals —.70 —.81 —.72 —.71 74 








Considering the relative unreliability of some of these variables, the 
above is a very impressive group of intercorrelations. We have two 
scales (G and +) which were derived wholly by internal item relation- 
ships and without regard to criteria of any non-test behavior; a scale 
(N) which corrects for the self-criticality of certain plus-getters who show 
deviant profiles; a scale (Ch) which differentiates hypochondriacs from 
non-hypochondriacal abnormals who have elevated H sccres; and a sub- 
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set of items (Hy-O) which were chosen because they differentiate a 
clinical group-—hysteria. There is, however, a considerable item overlap 
among these scales, tending to raise these correlations. On the other 
hand, it will be recalled that the scale K is not actually “pure” for the 
hypothetical test-taking attitude because it is a composite of the test- 
taking scale L. plus the eight “psychotic” items. This would presum- 
ably tend to lower the correlations. Accordingly, we have substituted 
L, for K, removed the item overlap among the scales G, N, Ch, Ls and 
Hy-O, and calculated correlations among these reduced keys. Table 3 
shows the intercorrelations among these five non-overlapping keys, 
based upon the responses of 150 unselected normal males between the 
ages of 26 and 45, rejecting records with ? > 70 or F > 80. All 
scales were scored so as to render the correlations positive. 


Table 3 


Intercorrelations of Five Scales Thought to be Loaded with the Test-taking Attitude, 
No Item Overlap. N = 150 Normal Males 





G Ch Le N 


82 

76 71 

.78 .73 
y-O0 -70 .63 





66 
-70 59 





This correlational matrix has been subjected to a factor analysis, re- 
peated three times in successively approximating the communalities be- 
cause of the small number of tests. The first factor extracted leaves no 
residuals larger than .049, and the SD of the residuals is .032, which is 
less than the SE of .041 attached to the mean r in the matrix. Testing the 
significance of the residuals by the formula chi-square = 2(z) — z) ? 
(n — 3) (12, p. 339) the chi-square on the deviation of observed r’s from 
those predicted with the first factor loading was not significant (chi- 
square = 5.101, 5 d.f., P > .30). It appears that one common factor 
is quite sufficient to account for the intercorrelations of these scales. 
The factor loadings of the scales G, Ch, Ls, N, and Hy-O are .927, .868, 
.847, .818, and .770 respectively. It is interesting to find such a powerful 
factor running through scales derived by such diverse methods. It is 
also worth noticing that the largest loading of the K-factor is in the one 
scale constructed wholly by “internal consistency” methods, whereas the 
smallest loading is that of the clinical variable Hy-O. If we extract a 
second factor just to see what it looks like, none of the loadings is over .20 
and the meaning of the second factor would be quite uninterpretable on 
our data. Although we have been thinking in terms of a “K-factor” on 
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the basis of the apparent community of practical function shown by these 
various scales, it is reassuring to find that the term “factor” may be used 
here without doing violence to the more technical meaning of that term 
as used by factor analysts. 

Considering the nature of the items which are involved in scales such 
as Ls, N, and G, this finding perhaps sheds some light on the relative in- 
adequacy of “neurotic” inventories such as the BI-N when applied to 
clinically diagnosed neurotics. Here we have a kind of item which, while 
it does not (in its own right) appear to discriminate normal from abnormal 
individuals very successfully, does reflect some kind of a test-attitude or 
self-critical component. Those “neurotic” persons who happen to be 
characterized by this particular manifestation of self-criticism, such as 
certain compulsives, will probably be differentiated by such a set of items. 
On the other hand, other equally “neurotic” persons such as hysterics, 
who are characterized by the opposite attitude, will not be successfully 
spotted by the scale. If anything, they should be discriminated back- 
ward! Furthermore, the central tendency of abnormals in general is the 
same as that of normals, and it is quite possible that in developing per- 
sonality questionnaires set up in the traditional, a priori fashion and 
“refined” by statistical manipulation we are merely setting up sets of 
items to differentiate among people with respect to various test-attitude 
continua of little or no psychiatric relevance. It will be recalled that the 
scale G consisted of items having the heaviest loading with whatever 
factor (or factors) contribute most to the variance and covariance of the 
entire 550 items in the MMPI pool. Yet this scale turns out to have little 
or no clinical value (except as a suppressor) and to be the scale most 
saturated with respect to the test-taking attitude. We feel that psychol- 
ogists have tended to forget the fact that when one constructs a person- 
ality inventory by studying the item-associations, whether by old- 
fashioned methods of internal consistency or by factor analysis of item 
correlations, he is merely locating certain covariations in verbal behavior. 
When a final scale based upon that kind of derivation is presented to the 
clinician, all that the clinician can be assured of is that persons who say 
certain things about themselves also have a tendency to say certain other things 
about themselves. 

Willoughby’s argument (65) that the non-chance covariation of item 
responses establishes “‘validity” with respect to some underlying, common 
trait which gives rise to the covariation may be admitted without con- 
tradicting what we have just said. That items should exhibit consistency 
in this covariant sense in spite of not being valid for the traits sought, or 
in fact even being negatively valid, has been shown by many studies, 
most particularly those of Landis and his associates (34, 35, 52). The 
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“underlying disposition” which leads a subject to respond in a certain 
way to such questions may or may not be identical with the dispositions 
we recognize as clinical variables, nor with those that might be suggested 
by the item content. It is quite clear on present evidence that this 
identification cannot be established by an assumed equivalence between 
non-test behavior and the verbal report. Hence, as has been repeatedly 
stressed by the present writers, both a priori selection of items and the 
psychological naming of a statistically homogeneous scale from its item 
content are fraught with possibilities of error. 

An obvious line of investigation which is suggested by these con- 
siderations is the systematic study of the relationships which exist among 
variables such as K, G, and N which are fairly definitely known to be 
chiefly test-taking variables, and other personality scales which have 
been developed by variants of the method of internal consistency. Be- 
cause of the influence of socio-economic or educational level upon the 
K-factor (see Section VI below) such studies should ideally be carried out 
upon subjects from the general population. At present, we can only 
report a few preliminary studies which seem to have some bearing upon 
this question. All of these studies happen to be concerned with the 
batteries developed by Guilford and Martin (GAMIN, STDCR, and the 
Personnel Inventory). We wish to emphasize that the presentation of 
these scattered data on our part is intended simply to raise some questions 
concerning the construction of scales by internal consistency methods 
where factors such as K are probably in operation; the ‘validity of the 
Guilford-Martin scales must of course be assessed upon other grounds. 
We wish further to stress that in comparing these tests with MMPI we do 
not intend to set the latter up as a “criterion,” although it does of course 
have the advantage that each item is known to differentiate certain 
defined criterion groups which literally define the scales on which the item 
oecurs. It should also be made clear that Guilford, as one of the fore- 
most contributors to the factor analytic approach to personality test 
construction, has explicitly called attention to the importance of the 
problem of test-taking attitudes as ‘factors,’’ when he says, 


“‘We must constantly remember that the response of a subject may not 
represent ompetiy what the question implies in its most obvious meaning. 
Subjects respond to a question as at the moment they think they are, wit 

rhaps a lack of insight in many cases as to their real position on the question. 

ey also respond as they would like themselves to be and as they would like 
others to think them to be and as they wish the examiner to think them to be. 
They also respond with some to self-consistency among their own 
answers. Whether these determining factors are sufficiently constant to set 
up individual differences which are uniform in character and so constitute 
common factors in themselves is difficult to say. Should any one of them be 
. erry it should introduce an additional vector in the factor analysis’ 

18, p. . 





Pe LE ta MD Lee SRPMS 





K factor as Suppressor Variable 555 


It is our opinion that the data we have presented indicate that the 
answer to Guilford’s question is in the affirmative, and that the inclusion 
of a few K-type scales in a factor analysis would probably result in a 
somewhat different interpretation of the other tests and factors than 
would otherwise be the case. 

Wesley (64) has studied the relationships existing between the Guil- 
ford-Martin Personnel Inventory of traits O-Ag-Co and the MMPI 
scales, based upon the test records of 110 presumably normal college 
women. The three traits measured by the Personnel Inventory are 
called objectivity, agreeableness, and cooperativeness by their authors. 
High scores are in the direction of the traits named, and low scores indi- 
cate the presence of what is called in composite the “paranoid” per- 
sonality. Wesley found that the composite Personnel Inventory score 
correlated only .11 with the MMPI Pa scale which, while still in a pre- 
liminary stage, does consist of items which are empirically known to 
distinguish clearly paranoid groups of persons from people in general. 
Together with this rather disconcerting finding, she also discovered that 
the “paranoid” score on the Personnel Inventory correlated .50 and .57 
with the MMPI scales Pt and Sc—both of which are relatively weak 
scales from the standpoint of clinical differentiation but are known to be 

heavily loaded with the K-factor. The correlations of “objectivity” with 
) Pt and Se were both —.62, which led her to correlate Trait O with the 
: correction scale N, leading to the same figure. None of the other correla- 
tions of the Guilford scales with MMPI scales exceeded .45, and the 
majority of them were under .20. The mean MMPI profile of subjects 








: selected on the basis of having low raw scores on N (the “defensive” end) 
, showed a pattern hardly distinguishable from that of subjects selected 
‘ for having high scores on Factor O. It is interesting to note in passing 
, that of the seven items of very similar wording which occur on both the 
. Guilford-Martin Inventory and the MMPI Pa scale, five are scored as 
$ “paranoid” in the opposite direction on the two scales. For example, 
: to say that most people inwardly dislike putting themselves out to help 

others, that most people would tell a lie to get ahead, that some people 
t are so bossy and domineering that one feels like doing the opposite of 
, what they tell him to do, are responses scored as paranoid on the Guil- 
. ford-Martin; whereas it is found empirically that these verbal reactions 
e are actually significantly less common among clinically paranoid persons 
“ than they are among people generally. This kind of finding suggests that 
at paranoid deviates are characterized by a tendency to give two sorts of 
a responses, one of which is obviously paranoid, the other “obviously” not. 


” But these two sorts of responses are negatively correlated among people 
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generally, and hence appear scored oppositely on scales developed by 
internal consistency methods. 

It is of course possible to begin the development of scales by internal 
consistency or item-intercorrelation procedures, and having built a scale 
by these methods, to apply it to various criterion groups for validation. 
But it would seem that if the aim is to find items which will optimally 
perform such a discriminating function, the most direct route to that goal 
is immediate empirical item selection from the start. It may be agreed 
that scales developed through item-correlation techniques have more 
statistical “purity” and hence are in a certain special sense better for 
what they do measure. One’s attitude toward this problem is likely to 
reflect his more fundamental views as to the nature of a so-called ‘“‘measure- 
ment”’ in personality testing, complete discussion of which would take us 
beyond the present paper. It seems clear that the results of factor analy- 
sis to date have not, whatever their theoretical validity, made possible the 
construction of single personality items which can be called even ap- 
proximately “pure.” For example, in Guilford’s factor analysis of 89 
personality items originally chosen (on the basis of suggestions from a 
previous factor analysis) to sample seclusiveness, thinking introversion 
and rhathymia, after the extraction of nine different factors the majority 
of the items still showed communalities less than .50. Torrens (61), 
Wesley (64), and Loth (36) all found that the typical scale intercorrelation 
among the variables of the Guilford-Martin batteries STDCR, GAMIN, 
and the Personnel Inventory is actually higher than the typical intercor- 
q relations of scales on MMPI which were developed with almost no con- 
iF sideration for questions of scale purity or freedom from item overlap. 

Louis Wesley (personal communication) has suggested that the con- 
trast between the two methods of scale derivation is between mazimal 
measurement and meaningful measurement. By this is meant that internal 
| consistency methods lead to scales which measure whatever they measure 

. with ‘high consistency, large variance, great discrimination. This is 
i “maximai’’ measurement. It is suggested that the most important non- 
: test behaviors, which it is the aim of the test to predict, may not be as- 
sociated with the same variables which lead to the kind of consistency 
involved. We may, as in the case of the Pa scale, have to sacrifice the 
desire to have high item intercorrelations in order to score items so as to 
achieve the more fundamental aim of criterion discrimination. Since 
scales are so very “impure” at best, there does not seem to be any very 
cogent reason for sacrificing anything in pursuit of the rather illusory 
purity involved. 

There are multiple determiners which enter into a subject’s decision 
when he answers a personality item. One might say that all but a very 











K factor as Suppressor Variable 557 


few personality items have an inherently ‘‘multiphasic” character, ex- 
ceptions being such items as “I am a male.”” Obviously, if there existed 
or could be invented verbal items which were even approximately pure, 
the “scales” of such items could be extremely short and in fact the practi- 
cal value of substituting an inventory for a few brief oral questions would 
be much in doubt. But the items are not uniquely determined. This 
simple behavioral fact imposes certain limitations upon the progress of 
personality measurement, as has been pointed out by many critics. 
From the common sense point of view, the situation is not very different 
from what occurs in medical diagnosis or in the psychiatric interview. 
Almost all of the symptoms or responses which are in evidence are known 
to arise upon diverse bases. During a psychological interview, a woman 
may miscall her husband by the name of a former suitor, a phenomenon 
which is in itself ambiguous; perhaps she has recently seen the man in 
question, perhaps she has been reading a novel in which that name ap- 
pears, and perhaps—the psychiatrically significant possibility—she feels 
somewhat regretful for not having married him instead. Later, we find 
that she developed a headache on her wedding anniversary, also an 
ambiguous datum if it stands alone. Again, she is excessively effusive 
about how happy her married life is, andsoon. Itis through the hypoth- 
esis of marital dissatisfaction that these different behaviors find a 
common explanation. When we accumulate such single items about her 
behavior, we are merely piling up the probabilities. It seems a little 
foolish to locate these behavior particles or their’ “sum” on a continuum 
of measurement, except in the most crude ordinal and probability sense. 
It is further quite likely that important configurational properties are 
also involved here, so that the significance to be assigned to one of these 
single facts should be a function of the other facts we know. The 
traditional scoring procedure of simply counting how many responses 
belonging to a certain class have been made seems to be very crude; 
fortunately it has been repeatedly found that the various weightings, 
compositions, and non-linear refinements which the behavioristic logic 
might suggest do not usually make sufficient practical difference in the 
ordering and sorting of people to be worth doing. The fact that we find 
it convenient to treat these behaviors in certain mathematical ways 
(independent scoring, unit weights, summation, linear transformations, 
etc.) should not mislead us into supposing that we are doing anything 
very close to what the physicist does when he cumulates centimeters. 
From this point of view, methods aimed at either ‘‘purity” or “internal 
consistency” are not easy to justify. At the very best, we have a rather 
heterogeneous collection of verbal responses which have a rough tendency 
to covary in strength. It may or may not be true that the most import- 
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ant (powerful) determiners of this tendency to covary are clinically 
relevant or personologically significant. For example, disliking one’s 
husband is not the most powerful “factor” in determining the frequency 
of headaches, among people generally. Nor is it the most potent factor 
in determining whether one calls him by the wrong name. Furthermore, 
the tendency to do these two things may not be covariant at all among 
people in general. None of these reasons, however, would lead us to 
reject the two facts in trying to evaluate the hypothesis of marital un- 
happiness. 

From both the logical and statistical points of view, the best set of 
behavior data from which to predict a criterion is the set of data which 
are among themselves not correlated. This is well known and made use 
of in the combination of scales into batteries; but for some reason psy- 
chologists are uncomfortable if the same reasoning is applied within scales. 
The statistical considerations are of course quite general, applying as 
well to items as to scales. It is likely that the insistence upon high 
internal consistency and “‘item validity” in the item-test correlation sense 
springs in part from a feeling that all of the items ought to be “doing the 
same thing.” This certainly sounds like a reasonable demand as it 
stands, but it requires clarification. As is clear from the factor analysis 
studies, one simply cannot find any appreciable number of non-identical 
verbal items which all ‘‘do the same thing.” Every one of them depends 
upon many things, and the item as a unit is like the old-fashioned atom— 
uncuttable and hence permanently impure. Items “do the same thing”’ 
when they are so combined in pools that it is very unlikely that the sub- 
ject will answer many of them in the scored direction unless he is char- 
acterized by a certain strength or range of non-test behaviors which in 
turn depend upon the one (or few) “variables” that are common to the 
items. It may still (unfortunately) be the case that the heaviest contri- 
bution to each item consists of variables other than the ones we are inter- 
ested in. That this is in fact true is indicated by the typical values of 
item communalities. 

It is this state of affairs which we believe imposes limitations upon the 
efficiency of such suppressor scales as K. Since we cannot find items 
which depend upon only clinical abnormality, we try to find items which 
depend upon abnormality to an appreciable extent even though they 
unavoidably depend upon other things as well. The suppressor consists 
of items which unavoidably depend to some slight degree upon clinical 
abnormality, but to a greater extent upon the objectionable factors in the 
first set. By cumulating responses to the second set of items, we hope 
to get an indication of the strength of these other factors, which informa- 
tion is then used to correct for their undesired contribution to a score 





K factor as Suppressor Variable 559 


attained on the first. The impurity of the suppressor itself, however, 
sets limits to the efficiency of such a process. Thus, a subject may obtain 
a high depression score because he is a plus-getter. The strength of his 
plus-getting tendency is assessed by items such as those of K. However, 
a sufficiently great degree of depression will yield considerable deviations 
on K, since the K items themselves are not pure for the plus-getting 
tendency but are also slightly loaded with clinical abnormality. In such 
eases K operates against us. It is interesting to note that the K scale, 
itself a suppressor, also contains a suppressor in the form of the eight 
“psychotic” items—but here also the effort to suppress the unwanted com- 
ponents of the suppressor can only be imperfectly carried out. No re- 
finements of statistical technique enable us to escape the basic psycho- 
logical fact that our smallest behavior units, the responses made to single 
items, are inherently of this multiphasic character. 


VI. Relation of K to Age, Intelligence, and Socio-Economic Status 


In the study of the correction scale N it had been observed that college 
students (actually, high school graduates tested at the University Coun- 
seling Bureau prior to actual matriculation) showed a distinct elevation 
in the “lie” direction, averaging about one sigma above the general 
population mean. It was also found that the younger age group (16-25) 
showed a similar although smaller deviation, which was accounted for by 
the presence of a considerable number of medical students in that group. 
Furthermore, college graduates who had been some ten years out of college 
showed a mean T-score of about 60 on the N-scale. A similar trend is 
discernible in the case of K. The mean T-score of a group of 84 medical 
students is at'62, a deviation which is significant at the 1 per cent level. 
Both male and female pre-college cases average a T of 57 on K. This 
tendency falls in line with the fact that the mean MMPI curve for several 
college and pre-college groups, including some obtained elsewhere than 
at Minnesota, is a curve with a slight but consistent elevation on Hy, in 
spite of having an Hs below the mean. This indicates, as usual, a ten- 
dency to respond in the hysteroid fashion which elevates Hy-subtle 
enough to more than counteract the tendency to answer the somatic 
items on Hy in a non-hypochondriacal fashion. We are not prepared on 
present evidence to give an interpretation of this phenomenon. That it 
is not primarily a reflection of intelligence differences is suggested by a 
correlation of only .04 between K and ACE score among the pre-college 
cases, which, even taking their relative homogeneity into account, should 
be higher if intellect as such is the reason for the difference. If the factor 
at work here is not intelligence, nor the mere fact of being in college when 
tested, two other possibilities are socio-economic status and chronological 
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age. A group of W.P.A. workers in the young age group 16-25 showed no 
elevation on K whatsoever, which would favor the socio-economic inter- 
pretation. The mean K of a group of 50 normals aged 16-25, excluding 
college graduates and persons in college, was 13.5 (7 = 52). These 
figures would seem to eliminate mere chronological age as the chief basis 
of differentiation. We are left with socio-economic status as the most 
plausible remaining variable. What is needed is study of a group of 
persons in the upper socio-economic group who are not college students 
and have never been college educated. Unfortunately, we do not have 
a large enough sample of such persons to enable us to draw conclusions 
with certainty. The mean raw score on K for a group of 18 normal sub- 
jects classified in Groups I and II in the Goodenough classification, who 
were not, however, college graduates or attending college, was 18.50, 
which corresponds to a T of 61. In spite of the small N, this difference 
is great enough so that a ¢ comparing their mean with that of 156 un- 
selected normals from the other economic classes was highly significant 
(¢ = 6.055, P < .01). It seems plausible that the college, pre-college 
and college-educated elevation is reflecting chiefly a difference in socio- 
economic status, although further evidence on this topic should be col- 
lected. If this is confirmed by subsequent investigation, it will be inter- 
esting to speculate upon the possible ways in which membership in the 
upper classes generates the particular kind of defensiveness involved. 


VII. Summary and Conclusions 


The general problem of test-taking attitudes in their effect upon 
scores obtained on structured personality inventories is discussed. The 
literature on the subject is briefly surveyed, and a discussion given of the 
various approaches which have been taken in an effort to solve this prob- 
lem. The final result of many efforts to derive special scales for measur- 
ing various attitudes in the taking of the Minnesota Multiphasic Inven- 
tory is presented, with some indication of its validity. The relationship 
of this scale, called K, to other variables is used as a basis for discussing 
certain general problems in the theory of personality measurement. 
Conclusions are as follows: 


1. The conscious or unconscious tendency of subjects to present a 
certain picture of themselves in taking a personality inventory has a 
considerable influence upon their scores. 

2. We may distinguish two directions in this test-taking attitude: the 
tendency to be defensive or to put oneself in a too favorable light, and the 
opposed tendency to be overly honest and self-critical (plus-getting). 
The extremes of these tendencies are deliberate, conscious efforts to fake 
bad or lie good. 
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3. The defensive tendency appears to be related to the clinical 
picture of hysteria, whereas plus-getting is related to the picture of psy- 
chasthenia. 

4. The MMPI scales L and F, while relatively effective in detecting 
extreme distortion, do not seem to be sufficiently subtle to detect the more 
common and often unconscious varieties of defensiveness or plus-getting. 
It has been found convenient to begin interpretation of L in the range of 
T-scores 55 or 60; whereas F does not clearly establish invalidity even up 
to T-score 80 (raw score about 16). 

5. By contrasting item frequencies of abnormal persons showing 
normal MMPI profiles and elevated L scores, with the records of un- 
selected normals, an empirical key called K has been derived which is 
relatively successful in detecting the influence of disturbing test-taking 
attitudes and can be used to improve the discrimination between normals 
and abnormals. 

6. In studying the intercorrelations among a group of scales derived 
by various means but all functioning with some effectiveness to detect 
such attitudes, it was found that one common factor is sufficient to ac- 
count for all of the intercorrelations. The scale (G) which has the 
largest factor loading was derived by a method of internal consistency 
and without recourse to any external criterion. Since K is the scale 
being used to measure this factor, the factor in question has been called 
K-factor. 

7. On the basis of these findings and study’ of the relationship of 
MMPI to certain of the Guilford-Martin scales, it is suggested that 
perhaps the construction of personality inventories by means of item- 
correlation and factor analytic methods leads to the development of 
tests which are excessively loaded with such test-taking attitudes. The 
procedure of internal consistency in its various forms is called into 
question as a profitable method for the construction of personality in- 
ventories. 


Received July 9, 1946. 
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Rogers, Carl R., and Wallen, J. L. Counseling with returned servicemen. 
New York: MacGraw-Hill Book Co., 1946. Pp. 159. $1.60. 


This clear, concise little book is intended to be a manual for the train- 
ing of counselors, particularly that large group of men and women who, 
during the war and its after-math, have become counselors by virtue of 
the needs of others rather than as a result of a long-term plan and training 
program of their own. 

Its scope is not, however, as broad as its authors claim. It is not a 
manual of counseling, but rather a manual of non-directive counseling. 
As such it is an admirable primer. It begins with a brief discussion of 
the nature of non-directive counseling, of aspects of the psychology of 
adjustment most important to that type of therapy, and of the attitude of 
the non-directive counselor. This introductory material is followed by a 
description, illustrated by well-selected excerpts from interviews and 
case summaries, of the methods and processes of non-directive counseling. 
Two chapters are devoted to vocational and educational counseling and to 
marital counseling, not because the authors consider them special types, 
but because many others do. There is an excellent discussion of the use 
of the casual contact, demonstrating how well it can lend itself to non- 
directive counseling, followed by a series of exercises in responding to 
clients’ statements and questions which are brought together to give some 
preliminary “practice in counseling’—an excellent teaching device. 
An appendix of “further reading,” all of which has to do with non- 
directive counseling or the attitudes of returned servicemen, and a good 
index complete the volume. 

Rogers’ viewpoint is already well known to psychologists, and its 
general nature as expounded in collaboration with one of his former 
students in this brief manual needs no comment here. Some points, 
however, have been made more specific in this manual than in Rogers’ 
previous writings, and should be noted because in some cases they 
sharpen our insights into the counseling process, and because in others 
they are, in this reviewer’s opinion, misleading unless modified. 

Seven assumptions or claims are made by Rogers and Wallen, each of 
which should at least be recognized and, in time, validated or rejected as 
false. 

First, is the assumption that most people are maladjusted and need 
psychotherapy (pp. 1, 2, 14, 90, 96, and 104—“‘usually . . . the state- 
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ment of a vocational or educational problem really disguises a deeper 
personal problem’). There are already some investigations of this 
question, e.g. Bragdon’s college study, not referred to by the authors, who 
implicitly rule out the possibility of anyone being well enough adjusted 
before seeing a counselor for his desire for expert help in making a decision 
to be genuine. Must we undergo psychotherapy each time we want to 
consult a banker about financing a car, an architect about building a 
house, or a school counselor about choosing a college? 

Secondly, it is assumed that maladjustments are primarily the result 
of attitudes (pp. 2, 113), the possibility of situational maladjustments 
being not even considered. Counseling, as conceived of by Rogers and 
Wallen, therefore consists of enabling the client to become aware of his 
attitudes. This is an important type of problem to be able to handle, and 
a type of therapy all counselors should be able to use. But it is surprising 
that these authors are blind to the importance of the environment and to 
the effectiveness of a changed situation in many cases, for those who 
dealt with cases of “operational fatigue” in military hospitals saw many 
“cures’’ when discharge papers or V-J Day were in sight, which were at 
least as good as those which non-directive counseling might have achieved. 
Rogers’ USO experience apparently did not bring him in contact with 
these cases in a way to give him such insights, and Wallen’s military 
service was only in the training stages of the war when neurotic reactions 
were less common and cures even less so, but neuroses have long been 
known to be a means of getting out of difficult situations and of getting 
compensation. 

The third assumption is a part of the second, namely that acceptance 
of responsibility is more to be sought after than good adjustment. This is 
implicit throughout the book. The authors frequently point out the 
need to leave everything up to the client, including the right to refuse 
counseling (reductio ad absurdum: this is successful counseling, as it is an 
assumption of responsibility by the client). Ability to operate on one’s 
own is certainly a desired outcome of counseling, but there are conditions 
in which it is impossible, e.g., psychosis, childhood delinquency. Rogers 
seems not to recognize that some adjustment directed by the counselor 
may at times need to precede the assumption of responsibility, just as the 
development of appropriate attitudes often needs to precede the use of 
information. 

Fourth, is the claim that the viewpoint of this book is new (pp. 5, 23- 
24). Rogers and Wallen are unfair to Rank, Taft, and other relationship 
therapists, whose views this reviewer, together with many other students, 
studied and tried out at this institution ten years ago, when Rogers was 
doing the same in Rochester prior to writing his own books. To say this 
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is not to minimize the great service he has rendered in clearly and con- 
cisely expounding and illustrating this type of therapy, and in giving the 
work a research basis. 

Fifth, it is claimed that traditional (pre-Rogerian) counseling means 
pointing out the steps the client should take (pp. 5, 6, 91, 95). This is 
unfair to the best professional counselors of almost any school, whether of 
psychoanalysis, vocational guidance, or other. Such counseling has in 
general characterized the relatively inexperienced, untrained, and in- 
secure counselor. Rogers’ approach safeguards these from this type of 
error. His misapprehension is perhaps due to an inadequate understand- 
ing of what interpretation is; his illustrations of interpretation (e.g., p. 
146) are not what this reviewer would classify as such. Interpretation 
is actually non-directive or client-centered rather than counselor-centered, 
if well done; its objective is to hasten insight. But Rogers and Wallen 
do not see this (pp. 89, 142, 146). ‘“Traditional’’ vocational counseling 
is also misrepresented as accepting the presented problem as the real one 
(p. 94); this is too often true of vocational and educational guidance as 
practiced by ill-trained, inexperienced, or over-burdened school or 
counseling service personnel, but not as practiced by the better vocational 
counselors nor as advocated in the literature. 

Sixth, it is claimed that diagnosis and prescription involve judging in 
terms of oneself, imposing one’s own values (pp. 20, 27). The authors, 
therefore, deny the wisdom of either. But on pp. 103 to 105 they em- 
phasize the need for diagnosis, avoiding the actual term, of the use the 
client is trying to make of the relationship. The reviewer questions the 
elaim concerning diagnosis at least, and the authors prescribe the type 
of adjustment they value most: client acceptance of responsibility. 

Seventh, and last, is the apparent assumption that the counselor needs 
no training in the use of diagnostic techniques or in educational, voca- 
tional, or other types of information, even though the value of such 
techniques is admitted for some cases, at certain stages, in the chapter on 
vocational and educational counseling (recognition which seems to be 
forgotten at other points of the discussion). Even the readings in this 
“manual of counseling” include no references to works on testing or oc- 
cupational information. 

The above points are made, not because of a desire to decry the 
significance of this little volume, but because it has such a great contribu- 
tion to make that its limitations need to be clearly pointed out. Rogers 
and Wallen have done an excellent job of explaining and illustrating non- 
directive counseling, clarifying its objectives, methods, and steps. They 
have pointed out how one can assist one person whose problem consists 
of several persons, by working with him alone. They have shown the 
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usefulness of the casual contact, and how to use it. They have shown the 
need for adequate diagnosis (to give it its right name) before vocational 
counseling, and one way of doing it. They have written a good discus- 
sion of an old and widely used technique of test interpretation. They 
have made available some excellent instructional aids. The book should 
be widely read, assimilated, and in time re-written in the light of a better 
perspective. 
Donald E. Super 
Department of Guidance, 
Teachers College, Columbia University 


Beaumont, Henry. The psychology of personnel. New York: Longmans, 
Green and Co., 1945, pp. xiii + 306, $2.75. 


The author intended this book to be “a general introduction to the 
contributions which psychology has made and should continue to make 
with ever-increasing success to the solution of the problems of personnel 
management.” The book deals also with many of the non-psychological 
phases of personnel management. 

The real value of the book will be found in the chapters on ‘Training 
Employees,” “The Workers’ Health,” “Promoting Safety,” ‘“‘Providing 
Incentives,” and “Occupational Adjustment.” To these subjects the 
author contributes a fresh point of view and supplies excellent examples 
from industrial practice. The author misses several opportunities, 
however, to point out pscyhological applications. 

The chapters on ‘Analyzing Jobs,” ‘Selecting Employees,” ““Work- 
ing Conditions,” and “Merit Ratings” merely restate basic concepts and 
do that rather poorly. References to the reliability of tests, for example, 
are superficial and misleading (pages 67 and 71). In the section entitled 
“Selection Ratio” (page 80) the concept is never explained satisfactorily. 
Such inaccuracies combined with a lack of facility in expression detract 
considerably from the value of the book. 

The author maintains that occupational maladjustment may be pre- 
vented by the proper selection, guidance and training of workers and by 
proper labor conditions such as fair standards, conditions and hours of 
work, skilled supervision and effective incentives. In presenting such a 
case, the author is at his best. 

There is much opinion and little proof presented in the book, but this 
is a criticism of the personnel field more than a fault of the author. 

Psychologists and industrial men interested in personnel management 
should read this book because, in spite of its faults, it casts some new light 
on a field that is still rather dark. 


Charles C. Gibbons 
The W. E. Upjohn Institute for Community Research, 
Kalamazoo, Michigan 
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Boring, E. G. (Ed.) Psychology for the armed services. Washington: 
The Infantry Journal, 1945. Pp. xvii and 533. $3.00. 


This book aims to outline with a minimum of technical language, but 
geared to college level, what the whole body of psychological knowledge 
holds for the military man. It is offered as a textbook and as a handbook 
of psychology, not simply for instruction but for individual reading and 
reference. The level of presentation is distinctly higher than the com- 
panion volume Psychology for the fighting man, both in presentation of 
principles and in the development of their applications, yet it retains the 
readable qualities of the latter. 

The volume impresses the reviewer as a most commendable achieve- 
ment for several reasons and distinctly to the credit of psychology as a 
science and as a profession. 


1. It is a job done in the war emergency through the collaboration of 
some sixty persons and an editor. It should be a final answer to the ac- 
cusation that psychologists cannot agree among themselves upon any- 
thing but spend their time quarreling over their theories. 

2. It shows that there is in the subject matter of psychology a con- 
siderable and respectable body of facts which have been distiiled out of 
the research of the last fifty years. This is the foundation upon which the 
science will grow. 

3. It proves that these facts are not merely curiosa of the laboratory 
but rather that just about everything with which the psychologist has 
busied himself has practical utility. The teacher can now without 
apology to his students dust off his olfactometers, his aesthesiometers, his 
tachistoscopes, his pseudoscopes and his color wheels, for they have 
earned their right to a place in the applied laboratory. Of all the 
classical experiments the reviewer failed to find use only of warm spots, 
cold spots, touch spots, etc. Perhaps it is there and he just happened to 
overlook it. 


The war is over. But it was a total war and as such it brought within 
the scope of the armed services every civilian activity, no matter how 
specialized. For this reason, the book is as applicable to peace time 
activities as to the emergency of war. The six chapters dealing with 
sensory functions might seem to hold material least useful for peaceful 
living, but planes will still have to take off, fly through difficult weather 
and land. Men will have to communicate with each other under difficult 
conditions of hearing and people will still need to find their way through 
strange territory afoot and awheel. As for efficient methods of working, 
of learning and of teaching one can find ready use for all that the book 
has to tell. The same is true of the accounts of personal adjustment, of 
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vocational guidance and selection, of leadership and morale, of opinion 
and the forces that make it. Any civilian will profit from reading these 
chapters. 

Of course, there are things to criticise. Some statements are so 
compact as to be hard to understand, some errors of fact have crept in 
here and there, some items have more space devoted to them than they 
deserve, and shadings of meaning have been sacrificed for the sake of 
brevity. What seems to the reviewer to be the most notable achieve- 
ment, namely, the almost complete avoidance of controversial issues in 
the presentation of material, will irk those specialists of one or another 
point of view which seems to have been disregarded. Where choice had 
to be made, as for instance in the chapters on motivation and morale and 
on sex, the authors did just that. 

As the author of a civilian textbook on applied psychology, the re- 
viewer doffs his hat to the editor of a worthy competitor. 


Albert T. Poffenberger 
Department of Psychology, 
Columbia University 


Nesbitt, Murrough de B. The road to Avalon. Capetown, 8. Africa: 
Hadder and Stoughton, 1944. Pp. 226. 

Barton, Betsy. And now to live again. New York: Appleton-Century 
Co., 1944. Pp. 150. 


“Avalon” is the author’s dream of a colony where crippled men and 
women may learn to use their new limbs and regain both physical and 
mental balance. 

The Road to this Avalon is the author’s life story. At thirteen he 
lost both legs. Fourteen operations on his stumps, seven successive 
pairs of artificial legs, dreadful pain, poverty, illness—these are not the 
real story. The real story is learning to walk, to swim, to dive, to ride, 
to sail; the conquest of pain and self-consciousness; the winning back of a 
normal social life. It is no less an inspiring story that it has happened 
many times before. 

And yet I am not so sure of the effect of this book on others of like 
handicap. The author admits that he ran away from life because he was 
afraid of it. And there is a sort of implication that his Wanderlust, his 
inability to settle down to one job or one place for so many years, had in it 
something admirable, instead of being merely a personal idicsyncracy 
only incidentally connected with his handicap. Surely, however, under- 
standable it may be in his case or another’s, irresponsibility and restless- 
ness are no virtues. 

The author hates pity but I am not sure he does not pity himself. 
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And he writes to inspire in us both pity and admiration. There is, to 
my taste, a trifle too much of the Jack Horner about it all,—though 
sitting in a corner he certainly did not. 

In short, it requires superb artistry to write about oneself in such a 
way as to inspire others. Perhaps only scamps can write really good 
autobiographies; saints and heroes inevitably sound a little smug. 

Betsy Barton’s book, likewise, is written out of her own experience 
and suffering. She does not hesitate to tell of her own life in order to 
make a point. But she is always turning from her own life outward. 
The result is a less powerful but more wholesome book. 

Not that it is uninteresting. Simply and plainly written, with many 
human interest stories, she makes the age-old point that the prirua:y 
problem of the handicapped is his attitudes. The unity of the mind and 
body is forcefully put, though in old fashioned terms. If we really be- 
lieve in the unity, we should leave off talk of mind and body and speak 
of the human being, the person. 

This is not just to split hairs about words. As long as we are thinking 
in terms of a mind and a body, we must admit that there are “crippled” 
bodies. But when we think in terms of a person as a going concern,— 
eating, digesting, hoping, breathing, planning for the future, loving,—we 
see that there are many who are indeed handicapped by loss of limb or 
sight or hearing but who as persons are not crippled at all, but are glori- 
ously alive and whole. Many of us, on the contrary, though our members 
seem sound, are, with our petty aches and pains, our anxieties and our 
fears, our jealousies and our hates, and our indigestions, really crippled 
persons. 

Inspirational books are hard to write. They so easily get preachy; 
and preaching is dreadfully in the Bruce Barton family tradition, but 
here is one which carried its point. It isn’t a bod book for a lot of people 
with no visible handicap. And men and women who have been injured 
will find here valuable practical suggestions for learning “‘to live again.”’ 


Horace B. English 
Ohio State University 


Gann, E. Reading difficulty and personality. New York: King’s Crown 
Press, 1945. Pp. 149. $2.00. 


In an introductory discussion of reading disabilities and the causative 
factors involved, the writer stresses the view that the whole personality is 
involved in reading behavior and that there is a dynamic relation between 
the reader and the meanings he derives. One should seek, therefore, 
evidences of difficulty in the adjustment of the personality in relation to 
reading. The statement that personality disturbances invariably ac- 
company reading difficulties, however, is not strictly accurate. 
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The hypothesis to be tested: “Dynamic processes in the personality 
organization which determine its means or types of adaptations are re- 
lated to, and influential in the reading experience. These processes are 
associated with or may be responsible for the difficulties or retardations.”’ 
Personality patterns of retarded, average and superior readers were com- 
pared. Personality was measured by the Rorschach Test and an in- 
ventory. The author depends upon published statements that their 
reliability and validity are adequate. Other rating scales and interest 
inventories were devised but were not evaluated for either reliability or 
validity. 

According to the claims of the author, the Rorschach system shows 
that retarded readers, in comparison with average and superior readers, 
(1) are emotionally less well adjusted and less stable, (2) have feelings of 
insecurity, and (3) are socially less adaptable. The retarded reader is 
resistant to reading experience, reflects less interest in and occupation with 
reading, and shows signs of an unfortunate teacher-pupil relationship. 
The retarded reader is seen as ‘‘a functioning personality, organized in 
ways which would seem detrimental to efficiency in learning, especially 
with reference to reading.” The author considers that the normally 
adjusted child will learn to read with average success in the ordinary 
school situation. She suggests that personality difficulties of retarded 
readers are not due to lack of success in reading but come from other 
environmental influences. These inhibiting personality forces lead to 
reading disability and resolution of these forces, therefore, should be a 
first step in remedial work. As a matter of fact, there is nothing in the 
data to indicate whether reading disability causes maladjustment or vice 
versa. The author is drawing her conclusions and suggesting implications 
from results which might be revealed in future research and from the bias 
of her own view that reading difficulties and disabilities are part of a 
larger organization, the total personality. 

The reviewer is suspicious of the applicability of the statistical test 
employed to evaluate reliability of differences since it yields statistical 
reliability for microscopic differences. Thus, in comparing the mean 
rating of retarded and average readers for concentration, the difference 
between means of 3.68 and 3.59 with sigmas of .46 and .63 is .09. The 
computed critical ratio is 3.00 which is above the one per cent level of 
significance. When one considers the customary reliability and validity 
of ratings, the difference of .09 does not seem to have practical significance. 

The author has made a contribution in giving added emphasis to the 
factor of adjustment difficulties in reading disability. But unfortunately 
she has gone far beyond her data in discussing the implications of her 
findings. 

University of Minnesota Miles A. Tinker 
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New Books, Monographs, and Pamphlets 


Books, monographs, and pamphlets for listing and possible review should be sent to 
Donald G. Paterson, Editor, Department of Psychology, University 
of Minnesota, Minneapolis 14, Minnesota 


New careers in industry. Amiss and Sherman. New York: McGraw- 
Hill Book Co., Inc., 1946. Pp. 227. $2.50. 

Breaking the skilled labor bottleneck. Eugene J. Benge. Connecticut: 
National Foremen’s Institute, Inc., 1942. Pp. 98. $1.00. 

How to make a morale survey. Eugene J. Benge. Connectitut: National 
Foremen’s Institute, Inc., 1941. Pp. 102. $7.50. 

Job evaluation and merit rating. Eugene J. Benge. Connecticut: Na- 
tional Foremen’s Institute, Inc., 1946. Pp. 107. $7.50. 

Your problem—can it be solved? Dwight J. Bradley. New York: The 
Macmillan Co., 1945. Pp. 213. $2.00. 

A chart for the rating of foremen. R.D. Bundy. Connecticut: National 
Foremen’s Institute, Inc., 1945. Pp. 8. $.50. 

Objective and experimental psychiatry. D. Ewen Cameron. New York: 
The Macmillan Co., 1946. Pp. 390. $4.25. 

Personal adjustment. Knight Dunlap. New York: McGraw-Hill Book 
Co., Inc., 1946. 

Statistical analysis. Allen L. Edwards. New York: Rinehart & Co., 
Inc., 1946. Pp. 360. $3.50. 

Guidance practices at work. Erickson and Happ. New York: McGraw- 
Hill Book Co., Inc., 1946. Pp. 325. $3.25. 

Industrial management in transition. George Filipetti. Chicago: Rich- 
ard D. Irwin, Inc., 1946. Pp. 311. $3.75. 
Enrollment increases and changes in the mental level. F.H. Finch. Stan- 
ford University: Stanford University Press, 1946. Pp. 75. $1.25: 
How to evaluate supervisory jobs. Albert N. Gillett. Connecticut: Na- 
tional Foremen’s Institute, 1945. Pp. 90. $7.50. 

Guide to guidance. Volume VIII. M. Eunice Hilton. New York: 
Syracuse University Press, 1946. Pp. 58. $1.00. 

The biology of schizophrenia. Roy G. Hoskins. New York: W. W. Nor- 
ton & Co., Inc., 1946. Pp. 191. $2.75. 

People in quandaries. Wendell Johnson. New York: Harper & Brothers 
1946. Pp. 532. $3.00. 

Counseling techniques in adult education. Paul E. Klein and Ruth E. 
Moffitt. New York: McGraw-Hill Book Co., Inc., 1946. Pp. 185. 


$2.00. 
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How to handle labor grievances. John A. Lapp. Connecticut: National 
Foremen’s Institute, 1946. Pp. 294. $4.00. 

People and books. Henry C. Link and Harry Arthur Hopf. New York: 
Book Industry Committee, Book Manufacturers’ Institute, 1946. 
Pp. 166. $10.00. 

Psychiatry for social workers. Lawson G. Lowrey. New York: Columbia 
University Press, 1946. Pp. 337. $3.50. 

Psychology in industry. Norman R. F. Maier. Boston: Houghton 
Mifflin Co., 1946. Pp. 463. $3.00. 

How to select foremen and supervisors. R. C. Oberdahn. Connecticut: 
National Foremen’s Institute, 1944. Pp. 60. $2.00. 

An introduction to educational statistics. Charles W. Odell. New York: 
Prentice-Hall Inc., 1946. Pp. 270. $3.50. 

Modern clinical psychology. T. W. Richards.. New York: McGraw-Hill 
Book Co., Inc., 1946. Pp. 340. $3.50. 

Manual of advisement and guidance. Ira D. Scott. Washington, D. C.: 
Superintendent of Documents, U. 8. Government Printing Office, 
1945. Pp. 233. $1.25. 

Sex and the social order. Georgene H. Seward. New York: The Mc- 
Graw-Hill Book Co., Inc., 1946. Pp. 301. $3.50. 

Job evaluation and employee rating. Richard C. Smyth and Matthew J. 
Murphy. New York: McGraw Hill Book Co., Inc., 1946. Pp. 255. 
$2.75. 

The personnel primer. Charles 8. Stevenson. Connecticut: National 
Foremen’s Institute, 1945. Pp. 32. $.25. 

Child psychology for professional workers. Florence M. Teagarden. 
New York: Prentice-Hall, Inc., 1946. Pp. 613. $3.75. 

Living issues in philosophy. Harold H. Titus. New York: American 
Book Co., 1946. Pp. 436. $3.25. 

An international convention against antisemitism. Mark Vishniak. New 
York: Research Institute of the Jewish Labor Committee, 1946. 
Pp. 135. $2.50. 

Proceedings of the 1945 annual conference of the Life Office Management 
Association. New York: Life Office Management Association, 1945. 
Pp. 241. $5.00. 

Ohio State and occupations. The Occupational Opportunity Service. 
Columbus: The Ohio State University Press, 1945. Pp. 198. $1.50. 

Training supervisors in human relations. Policyholders Service Bureau. 
New York: Metropolitan Life Insurance Co., 1946. Pp. 53. Gratis. 

Selection of sales personnel and aptitude testing. The Society for the 
Advancement of Management. New York: Sutton-Malkames, Inc., 
1945. Pp. 137. $4.00. 
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